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Abstract 

This  research  uses  a  Bayesian  framework  to  develop  probability  densities  for  target 
detection  system  performance  metrics.  The  metrics  include  the  receiver  operating 
characteristic  (ROC)  curve  and  the  confidence  error  generation  (CEG)  curve.  The  ROC 
curve  is  a  discrimination  metric  that  quantifies  how  well  a  detection  system  separates 
targets  and  non-targets,  and  the  CEG  curve  indicates  how  well  the  detection  system 
estimates  its  own  confidence.  The  degree  of  uncertainty  in  these  metrics  is  a  concern  that 
previous  research  has  not  adequately  addressed.  This  research  formulates  probability 
densities  of  the  metrics  and  characterizes  their  uncertainty  using  confidence  bands. 
Additional  statistics  are  obtained  that  verify  the  accuracy  of  the  confidence  bands. 
Methods  for  the  generation  and  characterization  of  the  probability  densities  of  the 
metrics  are  specified  and  demonstrated,  where  the  initial  analysis  employs  beta  densities 
to  model  target  and  non-target  samples  of  detection  system  output.  For  given  target  and 
non-target  data,  given  functional  forms  of  the  data  densities  (such  as  beta  density  forms), 
and  given  prior  densities  of  the  form  parameters,  the  methods  developed  here  provide 
exact  performance  metric  probability  densities.  Computational  results  compare 
favorably  with  existing  approaches  in  cases  where  they  can  be  applied;  in  other  cases  the 
methods  developed  here  produce  results  that  existing  approaches  can  not  address. 
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UNCERTAINTY  ESTIMATION  FOR  TARGET  DETECTION  SYSTEM 
DISCRIMINATION  AND  CONFIDENCE  PERFORMANCE  METRICS 


1.  Introduction 

This  chapter  introduces  target  detection  systems  and  metrics  that  characterize  their 
performance,  reviews  existing  research  on  these  metrics,  summarizes  the  contributions  of 
this  research,  and  presents  the  dissertation  organization. 

1.1  Target  detection  systems 

Decision  systems  accept  input  data  and  generate  decision  output(s).  Examples  arise  in 
artificial  intelligence,  speech  processing  systems,  medical  diagnostic  systems,  and  target 
detection  systems.  Typically,  decision  systems  make  estimates  of  decision  suitability,  but 
do  not  declare  unequivocally  that  a  particular  output  or  action  is  proper  (see 
[Ross  and  Minardi,  2004]).  A  target  detection  system  under  test  (SUT),  the  decision 
system  of  interest  in  this  research,  attempts  to  estimate  the  probability  that  given  input(s) 
contain  a  target.  The  inputs  are  often  images,  e.g.,  from  synthetic  aperture  radar  (SAR), 
although  the  results  of  this  research  extend  to  other  types  of  inputs.  Tanks,  improvised 
explosive  devices  (IEDs),  and  vehicles  containing  explosives  are  some  examples  of 
targets. 

Estimates  of  target  probability  are  referred  to  as  scores  (see  [Wise  et  al.,  2004]).  By 
selecting  a  threshold  score  value,  all  scores  greater  than  a  certain  value  may  be  declared 
targets  and  all  scores  less  than  the  selected  value  may  be  declared  non-targets.  Thus,  the 
estimates  of  probabilities  may  be  transferred  from  a  continuous  domain  to  a  hard  yes/no 
binary  decision  on  whether  or  not  the  input(s)  contains  a  target.  To  understand  the 
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usefulness  of  score  values  and  the  related  varying  thresholds,  consider  two  scenarios, 
labeled  A  and  B. 


In  scenario  A,  SUT  A  attempts  to  detect  a  vehicle  containing  explosives  that  is  a 
significant  distance  (two  miles)  away  from  a  military  checkpoint  and  to  label  this  vehicle 
as  "target".  The  outcome  of  a  target  declaration  for  this  scenario  is  the  raising  of  barriers 
and  the  temporary  isolation  of  the  vehicle  at  a  point  one-half  mile  away  from  the 
checkpoint  so  that  if,  indeed,  the  vehicle  contains  explosives,  it  will  not  impact  either  the 
checkpoint  or  other  nearby  vehicles.  Once  isolated,  a  more  robust  stationary  monitoring 
system  is  used  to  examine  the  vehicle.  Here,  a  threshold  which  results  in  a  declaration  of 
"target"  that  stops  vehicles  with  explosives  but  that  also  stops  many  vehicles  without 
explosives  may  be  acceptable.  A  vehicle  without  explosives  that  is  inadvertently 
declared  a  "target"  will  not  be  damaged,  but  will  be  delayed  momentarily.  For  this 
example,  a  threshold  that  often  generates  false  alarms  in  that  it  declares  a  vehicle  without 
explosives  a  "target"  is  appropriate. 

In  scenario  B,  SUT  B  monitors  vehicles  that  approach  a  business  district.  In  this  case,  it 
is  impractical  to  raise  barriers.  However,  a  weapon  that  destroys  the  vehicle  engine  is 
available.  If  a  declaration  of  "target"  is  made,  then  the  weapon  will  be  used,  otherwise 
additional  sensors  will  continue  to  monitor  the  vehicle.  It  is  easy  to  see  that  the  threshold 
selected  in  this  scenario  may  need  to  be  much  higher  than  the  threshold  of  scenario  A  so 
that  few  false  alarms  are  generated,  even  if  the  scores  provided  by  the  SUTs  are  identical. 

1.2  Detection  system  performance  metrics 

There  are  two  desirable  properties  of  score  output  from  a  SUT.  The  first  property  is 
discrimination.  Discrimination  refers  to  the  ability  of  an  SUT  to  classify  target  events  as 
target  labels  and  non-target  events  as  non-target  labels.  This  capability  changes 
depending  on  the  selection  of  threshold.  For  a  selected  threshold,  the  probability  of 
improperly  declaring  a  non-target  event  as  a  target  label  is  referred  to  as  "false  alarm 
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probability".  Similarly,  the  probability  of  correctly  declaring  a  target  event  as  a  target 
label  will  be  referred  to  here  as  "correct  detection  probability".  The  purpose  of  the 
adjective  "correct"  here  is  to  emphasize  the  usage  of  "correct  detection  probability"  to 
describe  the  probability  of  correctly  declaring  a  target  to  be  a  target.  An  alternative  is  for 
"correct  detection  probability"  to  denote  the  probability  of  correctly  labeling  an  event 
regardless  of  whether  the  event  is  target  or  non-target;  this  alternative  is  not  used  here. 
Note  that  "false  alarm  probability"  and  "correct  detection  probability"  may  be  replaced 
by  various  synonomous  descriptions;  see  the  discussion  in  Section  2.2.  The  second 
property  is  accuracy  or  relevance  and  refers  to  whether  or  not  the  estimates  of  probability 
that  are  provided  by  a  SUT  are  accurate.  Both  properties  are  important  for  SUTs; 
methods  that  assist  in  evaluation  of  the  performance  of  SUTs  with  regard  to  such 
properties  are  now  introduced. 

If  the  behavior  of  an  SUT  over  a  varying  threshold  is  known,  then  the  discrimination 
property  can  be  described  by  a  plot  of  correct  detection  probability  versus  false  alarm 
probability.  This  plot  is  called  a  receiver  operating  characteristic  (ROC)  curve.  For 
example,  consider  signals  sent  from  a  transmitter  to  a  receiver.  The  receiver  attempts  to 
distinguish  "signal"  from  "noise".  The  receiver  does  not  know  for  a  selected  time  sample 
whether  or  not  a  signal  has  been  sent  but  does  measure  the  amplitude  of  a  demodulated 
signal  at  that  time.  The  receiver  must  choose  some  threshold  value  (e.g.  0.3,  0.5,  0.9, 
etc.),  to  declare  signal;  all  values  greater  than  the  threshold  are  declared  signal  and  all 
values  less  than  the  threshold  are  declared  "non-signal".  For  a  particular  threshold,  there 
is  a  correct  detection  probability:  among  all  signals  sent,  correct  detection  probability  is 
the  percentage  declared  as  signal.  Similarly,  for  a  particular  threshold,  there  is  a  false 
alarm  probability:  among  all  non-signals  sent,  false  alarm  probability  is  the  percentage 
declared  as  signal.  A  particular  threshold  might  result  in  a  high  correct  detection 
probability  but  also  a  high  false  alarm  probability;  selection  of  a  different  threshold  might 
reduce  false  alarm  probability  but  also  reduce  correct  detection  probability. 
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The  ROC  curve  described  here  is  formed  by  varying  a  single  threshold  of  score.  Figure 
1.1  shows  a  score-threshold  ROC  curve  and  its  generation  from  target  and  non-target 
probability  densities  of  score  (hereafter  "probability  density"  is  often  simply  "density"). 

It  is  possible  to  form  ROC  curves  that  use  multiple  thresholds  of  score,  e.g.,  target  is 
declared  between  two  thresholds,  and  non-target  is  declared  otherwise.  Such  ROC 
curves  may  be  generated  by  thresholding  the  likelihood  ratio,  which  is  the  ratio  of  target 
to  non-target  probability  density,  as  described  in  the  next  chapter. 

The  ROC  curve  is  useful  because  it  provides  a  tool  to  examine  the  trade-off  in  correct 
detection  probability  and  false  alarm  probability.  In  particular,  the  ROC  curve  assists  in 
understanding  the  relative  impact  of  accepting  a  higher  or  lower  false  alarm  probability. 

1.3  Discrimination  metrics  versus  confidence  metrics 

The  ROC  curve  quantifies  the  discrimination  capability  of  a  SUT;  the  accuracy  (or 
relevance)  of  estimates  of  target  probability  (such  estimates  are  referred  to  as  scores)  is  of 
parallel  importance  to  discrimination.  In  an  ideal  SUT,  the  estimates  are  without  error; 
that  is,  every  provided  score  is  an  accurate  indication  of  the  probability  of  obtaining  a 
target  given  the  score.  In  actual  SUTs,  estimates  of  probability  may  deviate  significantly 
from  actual  probability.  A  system  that  produces  an  estimate  of  probability  which  is  very 
accurate  is  one  that  maintains  a  high  degree  of  "confidence"  in  results.  Thus,  the  term 
"confidence"  is  used  to  describe  the  relation  of  an  actual  SUT  to  an  ideal  SUT  in 
accuracy  (or  relevance).”  Just  as  the  ROC  curve  characterizes  discrimination,  the 
performance  of  a  SUT  over  all  scores  can  be  characterized  by  a  plot  of  the  probability  of 
obtaining  a  target  given  a  particular  score  versus  score.  This  plot  is  called  a  confidence 
error  generation  (CEG)  curve. 

Both  the  ROC  curve  and  CEG  curve  are  useful  tools  for  comparing  SUTs,  thereby 
determining  which  SUTs  are  most  appropriate  for  a  particular  purpose.  Similarly,  both 
the  ROC  curve  and  CEG  curve  may  be  evaluated  for  a  single  SUT  to  determine  whether 
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Figure  1.1  Target  and  non-target  densities  and  the  ROC  curve  performance  metric.  The 
ROC  curve  quantifies  the  tradeoff  in  performance  between  probability  of 
correct  detection  and  probability  of  false  alarm  as  a  decision  threshold  is 
changed.  The  ROC  curve  has  correct  detection  probability  y  =  0  for  false 
alarm  probability  x  =  0  and  it  has  y  =  1  for  x  =  1.  In  the  left  plot  the  solid 
curve  is  the  probability  density  of  target,  the  dotted  curve  is  the  probability 
density  of  non-target,  and  both  densities  are  functions  of  score.  To  obtain  a 
score-threshold  ROC  curve,  a  threshold  is  swept  across  the  domain  of  pos¬ 
sible  scores  from  a  SUT.  For  example,  at  a  selected  threshold  score  0.57, 
every  score  greater  than  0.57  is  regarded  as  target,  and  every  score  less  than 
0.57  is  regarded  as  non-target.  Increasing  the  threshold  leads  to  a  reduced 
false  alarm  probability  and  also  a  reduced  correct  detection  probability. 
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or  not  the  SUT  is  appropriate  for  a  particular  purpose.  A  system  that  evaluates  an  SUT 
through  the  use  of  performance  metrics  (such  as  ROC  curves  and  CEG  curves)  is  referred 
to  as  a  Performance  Evaluation  System.  Note  that  the  term  metric  here  refers  to  a 
description  or  characterization  of  performance  or  efficiency;  this  meaning  is  consistent 
with  recent  use  of  the  term  metric  for  software  development  [Thing,  2002]  and  is  also 
consistent  with  recent  use  of  the  term  metric  specific  to  detection  system  performance 
evaluation.  For  example,  the  objective  of  a  recent  workshop  sponsored  by  the  Defense 
Advanced  Research  Projects  Agency  (DARPA),  National  Institute  of  Standards  and 
Technology  (NIST),  and  the  Institute  of  Electrical  and  Electronics  Engineers  (IEEE),  was 
to  define  measures  and  methodologies  for  evaluating  the  performance  of  intelligent 
systems,  and  it  was  entitled  "Performance  Metrics  for  Intelligent  Systems  Workshop" 
[Messina  and  Meystel,  2004].  But,  mathematically,  the  term  metric  is  a  real-valued 
function  defined  on  a  pair  of  objects,  with  specific  properties.  We  apply  the  formal  usage 
of  this  term;  the  entire  ROC  curve  and  CEG  curve  are  single  comparable  descriptions  of 
the  overall  performance  capability  of  a  SUT.  Note  that  the  terms  "measure" 

[Ross  and  Minardi,  2004]  and  "quantifier"  [Schubert  et  al,  2005]  could  also  be 
appropriate. 

1.4  Evaluation  of  a  system  under  test 

Figure  1.2  shows  the  relation  of  the  SUT,  performance  evaluation  system,  performance 
metrics  (such  as  the  ROC  curve  and  CEG  curve),  test  image  inputs,  and  truth  data.  To 
appropriately  develop  the  ROC  curve  and  CEG  curve  and  thus  characterize  SUT 
usefulness  for  a  particular  purpose,  large  amounts  of  test  data  are  desired,  where  for  this 
data  the  true  state  (target  or  non-target)  of  the  output  scores  is  known.  However,  such 
large  amounts  of  test  data  are  typically  unavailable  or  are  costly  or  time-consuming  to 
obtain.  As  a  result,  the  ability  to  quantify  the  uncertainty  in  the  ROC  curve  and  CEG 
curve  performance  for  limited  sets  of  data  is  important.  If  such  uncertainty  estimates  are 
available,  then  the  range  of  possible  values  of  the  curves  given  large  amounts  of  data  is 
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Figure  1.2  Evaluation  of  a  system  under  test  (SUT).  The  SUT  receives  a  test  image 
and  assigns  it  a  score  between  0  and  1.  A  score  near  one  indicates  high 
probability  that  the  test  image  contains  a  target,  and  a  score  approaching 
zero  indicates  a  low  probability  that  the  test  image  contains  a  target.  Once 
the  SUT  issues  a  set  of  scores,  a  performance  evaluation  system  compares 
the  scores  with  truth  data.  The  truth  data  indicates  true  state  of  target  or 
non-target  in  the  test  image,  but  does  not  refer  to  the  entire  test  image.  Per¬ 
formance  metrics  such  as  the  receiver  operating  characteristic  (ROC)  curve 
and  the  confidence  error  generation  (CEG)  curve  are  then  used  to  quantify 
SUT  performance. 
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understood,  and  acceptable  quantification  of  SUT  performance  may  be  possible  with 
limited  sets  of  data.  In  some  cases  uncertainty  estimates  may  lead  to  an  informed 
decision  that  more  data  is  needed.  In  other  cases,  the  decision  may  be  that  a  SUT  is 
suitable  for  a  particular  task. 

In  contrast  to  current  methods  for  uncertainty  estimation,  the  methods  developed  here 
estimate  the  probability  density  of  a  ROC  curve  based  on  a  Bayesian  framework  that 
fully  incorporates  available  information.  The  Bayesian  development  incorporates,  by 
definition,  all  that  is  known  or  assumed  about  the  sample  score  data,  the  probability 
density  forms  for  target  and  non-target  scores,  and  the  prior  probability  densities  of  the 
parameters  in  these  forms.  For  a  given  set  of  target  samples  and  non-target  samples, 
assumed  sample  density  models,  and  prior  densities  of  parameters,  there  is  only  one 
probability  density  of  the  ROC  curve.  The  Bayesian  formalism  permits  the  generation  of 
this  unique  ROC  curve  probability  density;  descriptive  statistics  such  as  the  median  ROC 
curve  and  ROC  curve  confidence  intervals  may  then  be  developed,  if  desired,  from  this 
probability  density.  Non-Bayesian  methods  either  do  not  fully  account  for  what  is 
known  about  the  data  models  and  prior  densities  or  can  only  account  for  this  knowledge 
in  an  ad  hoc  manner.  The  Bayesian  probability  density  of  the  ROC  curve  is  a  full 
account  and  is  extended  in  this  research  to  uncertainty  estimation  of  the  CEG  curve.  The 
results  shown  here  demonstrate  improved  uncertainty  estimation  methods  for  the  ROC 
curve  and  initiate  uncertainty  estimation  methods  for  the  CEG  curve. 

1.5  Existing  research  on  performance  metric  uncertainty 

There  are  existing  methods  that  estimate  ROC  curve  uncertainty.  However,  these 
methods  typically  make  unacceptable  assumptions;  for  example,  "binormal"  methods 
assume  that  the  target  and  non-target  score  densities  are  either  normal  or  may  be  made 
normal  after  transformation  and  generally  assume  that  the  probability  of  obtaining  a 
target  increases  as  score  increases.  "Bootstrap"  methods  do  not  make  such  assumptions 
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but  are  inaccurate  for  relatively  small  sample  size.  Still  other  methods,  such  as 
"binomial"  methods,  may  be  suitable  for  estimating  the  uncertainty  in  correct  detection 
probability  and  false  alarm  probability  at  a  selected  threshold  but  are  not  appropriate  for 
estimates  of  the  ROC  curve  over  all  thresholds. 

Figure  1.3  shows  a  comparison  of  confidence  band  results  obtained  by  the  method 
developed  here  to  the  most  prevalent  method  in  the  literature  [Metz  et  al.,  1998].  The 
solid  curve  in  the  figure  shows  the  true  ROC  curve,  which  is  deterministic  because  it  is 
generated  by  the  target  and  non-target  densities  from  which  the  score  samples  are  drawn. 
A  95%  confidence  band  based  on  an  observed  set  of  30  target  and  30  non-target  score 
samples  is  shown  for  the  method  developed  here.  The  Metz  method  (which  is  a  binormal 
approach)  produces  a  95%  confidence  band  that  is  wider  and  therefore  less  informative 
than  the  band  for  the  method  developed  here,  assuming  that  the  method  developed  here  is 
accurate  with  respect  to  the  assumed  density  forms  and  the  prior  densities  of  parameters. 
Chapter  3  considers  the  analytical  justification  for  the  method  developed  here,  and 
Chapters  4  and  5  demonstrate  its  accuracy.  Chapter  2  examines  the  Metz  method  and 
other  ROC  curve  confidence  interval  methods  in  detail.  The  method  developed  here 
performs  favorably  in  comparison  with  the  other  methods  (where  suitable  comparison  is 
possible).  More  importantly,  the  method  developed  here  shows  the  viability  of  a  flexible 
Bayesian  framework  and  enables  the  development  of  alternative  descriptive  statistics 
(such  as  initially  considered  in  [Parker  et  al,  2005a,  2005b]).  The  method  is  directly 
applicable  to  other  metrics,  such  as  the  CEG  curve  (the  CEG  curve  is  detailed  in  Section 
2.3;  see  also  [Parker  et  al.,  2005c]).  This  framework  permits  changes  in  model 
assumptions;  the  Metz  method  and  most  other  approaches  do  not  allow  such  changes. 

1.6  Summary  of  contributions  of  this  research 

The  research  reported  here  uses  a  Bayesian  framework  to  characterize  the  uncertainty  of 
target  detection  performance  metrics.  The  result  is  an  improved  understanding  and 
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Figure  1.3  Comparison  of  method  developed  here  with  the  method  of  Metz 
[Metz  etal,  1998].  Here  30  target  and  30  non-target  samples  are  drawn 
from  beta  densities  with  target  mean  0.715,  target  standard  deviation  0.01, 
non-target  mean  0.715,  and  non-target  standard  deviation  0.046;  the  solid 
line  is  the  true  ROC  curve.  Note  that  the  Metz  ROC  curve  95%  confidence 
band  is  extremely  wide  (and  uninformative)  compared  to  the  95%  confidence 
band  obtained  using  the  method  developed  here.  The  software  package 
"ROCKIT"  is  used  to  generate  the  confidence  intervals  for  the  method  of 
Metz. 
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quantification  of  ROC  curve  and  CEG  curve  uncertainty  for  target  detection  applications. 
The  framework  develops  ROC  or  CEG  curve  probability  densities,  which  completely 
describe  curve  uncertainty  for  given  samples  of  target  and  non-target  scores,  assumed 
density  forms  for  these  scores,  and  assumed  prior  densities  of  the  parameters  that  specify 
these  forms.  From  the  ROC  or  CEG  curve  densities,  a  transition  to  descriptive  statistics, 
such  as  median  curves  or  90%  confidence  intervals,  is  made.  The  framework  is  fully 
Bayesian  and  for  the  given  samples,  density  forms,  and  prior  parameter  densities  it 
provides  exact  performance  metric  probability  densities. 

The  framework  is  also  numerically  tractable,  and  the  calculated  ROC  and  CEG  curve 
densities  yield  substantial  improvements  over  existing  ROC  curve  uncertainty  estimation 
methods.  These  improvements  are  emphasized  qualitatively  in  the  identification  of 
fundamental  weaknesses  inherent  in  existing  ROC  curve  uncertainty  estimation  methods; 
in  addition,  quantitative  comparisons  are  made  which  verify  that  the  approach  developed 
here  compares  favorably  with  previous  approaches.  Further,  the  uncertainty  estimation 
process  is  shown  to  seamlessly  transition  to  the  CEG  curve,  a  metric  that  previously  has 
been  of  limited  use  due  to  a  lack  of  appropriate  methods  for  estimating  its  uncertainty, 
especially  for  limited  amounts  of  data.  From  the  framework  developed  here,  CEG  curve 
uncertainty  estimates  can  now  be  robustly  understood  and  obtained  even  for  low  numbers 
of  samples.  Thus,  for  the  CEG  curve,  the  research  presented  here  formulates  a  robust 
method  for  uncertainty  estimation  where  alternatives  do  not  exist;  for  the  ROC  curve  the 
research  presented  here  offers  a  significantly  improved  method  of  uncertainty  estimation 
for  which  the  alternatives  are  limited  by  inability  to  handle  low  numbers  of  samples 
and/or  by  restrictive  model  assumptions. 

1.7  Organization  of  this  dissertation 

Chapter  2  provides  background  on  the  uncertainty  estimation  problem  considered  here, 
provides  a  review  of  the  literature,  and  identifies  weaknesses  in  existing  uncertainty 
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estimation  methods  for  ROC  and  CEG  curves.  Chapter  3  describes  and  develops  both 
analytical  expressions  and  numerical  approximations  for  the  ROC  curve  probability 
density.  This  ROC  curve  density  is  then  used  in  Chapter  4  to  obtain  and  verify 
descriptive  statistics,  such  as  median  ROC  curves  and  90%  confidence  intervals.  Chapter 
5  provides  quantitative  comparisons  of  the  method  developed  here  to  previous  methods 
of  confidence  interval  estimation.  Chapter  6  summarizes  accomplishments  and 
contributions  and  identifies  areas  of  interest  for  future  work. 
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2.  Background 


Performance  metrics  such  as  the  ROC  curve  quantify  the  capability  of  a  target  detection 
system  to  distinguish  between  target  and  non-target  inputs.  Other  performance  metrics 
such  as  the  CEG  curve  examine  the  relevance  of  the  detection  system  outputs.  This 
research  develops  improved  methods  for  estimating  the  uncertainty  of  these  metrics  and 
many  other  types  of  target  detection  performance  metrics.  Since  a  ROC  curve  for  one 
target  detection  system  under  test  (SUT)  may  be  compared  to  a  ROC  curve  for  a  second 
SUT,  the  ROC  curve  and  CEG  curves  are  referred  to  here  as  metrics,  although  the  curves 
are  not  scalar  values. 

As  discussed  in  the  introduction,  the  outputs  of  a  target  detection  SUT  are  typically 
estimates  of  the  probability  of  target.  Such  estimates  are  referred  to  as  posterior 
probability  estimates  and  are  critical  in  appropriate  decision  making  (see 
[Bishop,  1995]).  For  example,  the  speech  processing  community  often  makes  estimates 
through  the  use  of  cross-entropy  (see  relation  of  speech  processing  techniques  to  the 
target  detection  field  by  [Ross  and  Minardi,  2004]).  A  speech  processor  typically 
examines  a  portion  of  observed  input  speech  and  attempts  to  match  this  input  with 
plausible  phonetic  sounds.  The  processor  does  not  declare  with  absolute  certainty  that  a 
portion  of  observed  speech  is  a  particular  sound;  however,  it  estimates  the  probability  of 
a  sound  or  group  of  sounds.  Then,  when  groups  of  adjacent  input  speech  are  examined, 
the  estimates  of  probability  are  used  to  formulate  words,  phrases,  and  sentences.  Similar 
to  the  speech  processor  example,  a  SUT  does  not  declare  a  target  with  certainty  but 
instead  estimates  the  probability  that  given  input(s)  contain  a  target. 

Specific  to  the  focus  here  on  target  detection,  development  and  use  of  the  CEG  curve 
performance  metric  by  the  Sensors  Directorate  of  the  Air  Force  Research  Laboratory 
motivates  this  research  (see  [Ross  and  Minardi,  2004]  and  [Wise  et  al.,  2004])  in  that 
CEG  curve  uncertainty  was  not  well  characterized.  Thus,  the  methods  developed  here 
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are  first  applied  to  the  ROC  curve  and  are  then  are  used  to  estimate  CEG  curve 
uncertainty.  However,  existing  approaches  to  ROC  curve  estimation  are  inadequate, 
particularly  when  low  numbers  of  inputs  are  available  and  when  normality  assumptions 
are  invalid;  these  conditions  are  both  common  constraints  for  target  detection  systems. 
The  methods  developed  here  show  improved  results  compared  to  existing  ROC  curve 
uncertainty  estimation  approaches  and  are,  in  fact,  optimal  (see  Section  2.5  and  method 
development  in  Chapter  3).  The  results  of  this  research  also  benefit  the  wide  range  of 
fields  that  use  ROC  curves  (e.g.,  medical  decision  making,  machine  learning). 

2.1  Target  detection  systems  and  their  performance  evaluation 

Figure  1.2  in  Chapter  1  shows  the  relation  between  a  test  input,  the  SUT,  the  performance 
evaluation  system,  and  performance  metrics.  (Note  that  although  a  test  image  is  used  as 
an  example,  the  process  also  directly  applies  to  other  types  of  test  inputs.)  For  each  test 
image  the  SUT  outputs  a  score  between  zero  and  one.  This  score  provides  an  estimate  of 
the  probability  that  the  image  contains  a  target.  A  score  of  one  estimates  a  probability  of 
one  that  the  image  contains  at  least  one  target,  and  a  score  of  zero  estimates  a  probability 
of  one  that  the  image  does  not  contain  a  target.  The  performance  evaluation  system 
knows  truth  for  test  cases;  that  is,  whether  an  image  actually  contains  a  target  or  not.  The 
performance  evaluation  system  has  two  input  types,  the  scores  for  many  images  from  the 
SUT  and  the  truth  (target  or  non-target)  associated  with  each  image.  Performance 
metrics  such  as  the  ROC  curve  and  the  CEG  curve  are  outputs  of  the  performance 
evaluation  system.  The  area  under  the  ROC  curve  (AUC)  value  and  the  CEG  curve 
summary  metric  of  root  square  deviation  (RSD)  value  are  also  considered.  A  key 
distinction  is  that  the  ROC  curve  and  AUC  value  describe  how  well  a  system  is  able  to 
discriminate  between  target  and  non-target  without  regard  to  whether  or  not  the  scores 
are  accurate  estimates  of  the  probability  of  target,  whereas  the  CEG  curve  and  RSD  value 
are  metrics  that  describe  such  accuracy  (or  relevance)  [Ross  and  Minardi,  2004]. 
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2.2  ROC  curves  and  AUC  values 


The  ROC  curve  (see  [Lusted,  1971]  and  [Swets,  1988])  is  a  plot  of  probability  of  correct 
detection  versus  probability  of  false  alarm  based  on  a  varying  threshold  for  detection. 
Figure  1.1  of  Chapter  1  shows  such  a  plot;  this  figure  also  demonstrates  the  calculation  of 
probability  of  correct  detection  and  probability  of  false  alarm  for  a  single  selected 
threshold.  The  ROC  curve  quantifies  the  trade-off  in  performance  between  correct 
detection  probability  (y)  and  false  alarm  probability  (x)  as  a  decision  threshold  (t)  is 
changed  (see  [Alsing,  2000]).  The  ROC  curve  derives  its  name,  receiver  operating 
characteristic,  from  its  original  application,  which  focused  on  radio  applications 
[Wickens,  2002].  Beginning  with  its  original  application  in  the  1950s,  it  has  been  used  in 
many  other  applications,  such  as  the  target  detection  performance  metric  that  is  the  focus 
of  this  research,  medical  decision  making  (e.g.  quantifying  the  probability  of  a  disease 
occurring  given  a  biological  marker;  see  [Hanley,  1999]),  and  machine  learning  (see 
[Macskassy  and  Provost,  2004]). 

Three  formal  definitions  related  to  the  ROC  curve  are  as  follows. 

(1)  Let  E  be  the  population  set  of  test  images,  where  the  test  images  either  contain  a 
target  (target  images)  or  do  not  contain  a  target  (non-target  images).  Based  on  an 
estimate  of  whether  each  image  e  E  actually  has  a  target,  an  SUT  produces  a  data 

QTJT 

score  d,  where  d  G  D  =  [0,1].  Thus,  the  SUT  maps  E  to  D  denoted  by  E  — »  D.  Let  0 
=  [0,1].  For  each  9  e  0,  let  a<?  be  a  classifier  mapping  D  into  a  label  set  L  denoted  by 
D  ^  L  where  L  ={ target  declaration,  non-target  declaration}.  Thus,  the  classifier 
system  is  E  S^>  D  L.  For  any  element  £  E  E,  d  E  D,  and  l  E  L,  choice  of  9  specifies 
the  classifier,  and  Equation  (2.1)  specifies  the  label  for  the  score-threshold  method: 

target  declaration:  d>=  9 
non-target  declaration:  d  <  9 
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The  threshold  for  detection  is  t,  where  t  is  a  specified  9.  The  above  is  adapted  from 
Schubert,  Oxley,  Bauer  [Schubert  et  al.,  2005],  who  provide  a  similar  classifier  definition 
but  with  application  to  a  more  general  classifier  system,  rather  than  the  score-threshold 
application  of  interest  here. 

(2)  Let  -E^et  be  the  subset  of  all  £  e  A  that  contain  target  images.  Let  Aarget  C  D  be  the 
subset  of  all  d  G  D  corresponding  with  Aarget-  Let  s  G  (-oo,  oo).  Let  g(s)  be  the  target 
score  probability  density  formed  by  all  Dtargel,  where  s  is  a  scalar  random  variable.  The 
correct  detection  probability  is 

/OO 

g(s)ds.  (2.2) 

(3)  Let  Anon-target  be  the  subset  of  all  e  £  A  that  contain  non-target  images.  Let 
Anon-target  C  D  be  the  subset  of  all  d  e  D  for  Anon-target-  Let  s  G  (-oo,  oo).  Let  f(s)  be 
the  non-target  score  probability  density  formed  by  all  Anon_target.  Specify  t  e  (-oo,  oo). 
For  the  score-threshold  method  described  by  Equation  (2.1),  let  t  =  9.  Th e,  false  alarm 
probability  is 

/OO 

f(s)ds.  (2.3) 

Typically  a  threshold  for  detection  (or  simply,  threshold)  is  applied  either  to  score  or 
likelihood  ratio,  where  the  likelihood  ratio  is  the  target  probability  density  divided  by  the 
non-target  probability  density.  The  threshold  of  interest  here  and  described  in  the  above 
definitions  is  score-threshold  (as  described  in  Equation  (2.1)),  because  the  primary 
objective  is  to  use  ROC  curves  and  AUC  values  (and  other  performance  metrics)  to 
quantify  whether  a  SUT  is  performing  optimally,  rather  than  to  use  the  ROC  curves  and 
AUC  values  to  optimize  SUT  performance.  If  the  threshold  for  detection  is  set  at  zero 
(i.e.,  all  score  values  are  declared  as  targets),  100%  of  targets  are  detected,  but  this  choice 
also  results  in  a  probability  of  false  alarm  equal  to  one.  If  the  threshold  for  detection  is 
set  at  one,  no  false  alarms  occur,  but  the  probability  of  correct  detection  is  zero.  An  ideal 
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ROC  curve  has  a  correct  detection  probability  that  equals  one  for  all  false  alarm 
probability  greater  than  zero.  Thus,  an  ideal  ROC  curve  has  an  AUC  value  that  equals 
one,  whereas  a  non-discriminating  ROC  curve  has  an  AUC  value  that  equals  0.5.  The 
AUC  value  is  the  integral  from  0  to  1  of  correct  detection  probability  y  as  a  function  of 
false  alarm  probability  x.  The  ROC  curve  is  the  set 

{(x,y)  G  [0,  l]x[0, 1] |y  =  r(x)\/x  G  [0,1]}.  If  r  is  the  function  that  generates  the  ROC 
curve,  so  that  y  =  r(x),  then 


AUC(r ) 


r(x)dx. 


(2.4) 


The  research  here  focuses  on  this  score-threshold  ROC  curve.  However,  an  alternative 
method,  which  is  not  desirable  for  comparison  of  multiple  SUTs  by  an  evaluator 
(assuming  that  the  evaluator  only  has  access  to  scores  provided  by  the  SUTs),  but  that 
can  be  a  desirable  tool  for  SUT  improvements,  uses  maximum  likelihood  (via  the 
Neyman-Pearson  Lemma;  see  [Scharf,  1991])  to  develop  the  ROC  curve.  A 
likelihood-ratio-threshold  ROC  curve  (see  [VanTrees,  1968]  and  [Scharf,  1991]),  is 
generated  by  thresholding  the  ratio  of  the  target  and  non-target  densities,  and  is 
consequently  convex  (this  curve  has  a  negative  second  derivative  at  each  false  alarm 
probability).  A  likelihood-ratio-threshold  ROC  curve  allows  multiple  positive  (i.e., 
target)  decision  regions  across  the  range  of  possible  score  values,  whereas  a 
score-threshold  ROC  curve  allows  only  one  positive  decision  region  (see 
[VanTrees,  1968],  [Shanmugan  and  Breipohl,  1988],  [Barkat,  1991],  and  [Scharf,  1991]). 
Figure  2.1  compares  the  procedures  for  generating  a  score-threshold  ROC  curve  and  a 
likelihood-based  ROC  curve.  The  score-threshold  ROC  curve  always  has  an  AUC  value 
equal  to  or  less  than  the  likelihood-ratio-threshold  ROC  curve,  assuming  that  the 
likelihoods  are  accurately  known  when  designing  the  detection  system.  Note  that  while 
the  target  and  non-target  densities  are  of  beta  density  form  in  the  example  used  in  the 
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figure,  this  property  holds  for  any  probability  density  (e.g.  Gaussian,  beta,  mixture  of 
beta,  etc.). 

To  understand  the  rationale  for  using  score  threshold,  consider  a  system  under  test  that 
declares  a  score  of  "0"  for  all  targets  and  a  score  of  "1"  for  all  non-targets.  Since  the 
scores  provided  by  a  SUT  are  estimates  of  the  probability  that  the  evaluated  image  is  a 
target,  this  performance  is  obviously  poor.  The  corresponding  score-threshold  ROC 
curve  has  an  AUC  of  zero,  affirming  that  the  system  is  performing  poorly.  In  contrast,  a 
likelihood-ratio-threshold-ROC  curve  estimated  ROC  has  an  AUC  of  one.  Thus  a 
likelihood-ratio-threshold  ROC  curve  may  be  of  significant  interest  for  developing  a 
target  detection  system,  but  a  score-threshold  ROC  curve  is  most  relevant  to  the  objective 
of  evaluating  system  performance. 

Figure  2.1  shows  deterministic  target  and  non-target  densities,  each  for  two  specified 
parameters  (see  Equation  (3.1))  and  compares  a  score-threshold  approach  with  a 
likelihood-ratio-threshold  approach.  Note  that  while  beta  densities  are  the  focus  of  these 
figures,  the  methods  developed  here  extend  to  other  density  forms  (see  Figure  3.4  and 
related  discussion  in  Section  3.1). 

A  theorem  that  provides  an  analytical  form  for  the  ROC  curve  is  as  follows. 

Theorem  2. 1  Score-threshold  ROC  curve 


Let  /(s;  u )  and  g(s ;  v )  be  densities  of  s  given  parameters  u  and  v,  where  s  is  a 
real- valued  random  variable  between  zero  and  one,  s  <G  [0, 1],  /(s;  u )  is  the  non-target 
score  probability  density,  g(s ;  v)  is  the  target  score  probability  density,  u  is  a  parameter 
vector  that  specifies  the  non-target  score  density,  and  v  is  a  parameter  vector  that 
specifies  the  target  score  density.  Let  /  and  g  be  integrable  over  [0, 1]  for  each  u  and  v, 
and  for  each  t  G  [0,1]  define 


F{t-u) 


/(s;  u)ds  =  x, 


(2.5) 
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Figure  2. 1  Comparison  of  score-based  and  likelihood-based  ROC  curve  generation.  In 
the  score -based  threshold  approach  (top  figure),  a  probability  of  correct  de¬ 
tection  is  calculated  by  selecting  a  threshold  for  score  (e.g.,  0.53),  and  inte¬ 
grating  over  the  target  density  (solid  curve)  from  that  threshold  to  1.  Simi¬ 
larly,  a  probability  of  false  alarm  is  calculated  by  integrating  the  non-target 
density  (dotted  curve)  over  the  same  domain.  The  values  for  probability  of 
correct  detection  and  probability  of  false  alarm  form  a  point  on  the  ROC 
curve,  and  the  ROC  curve  is  formed  by  varying  the  threshold  from  0  to  1.  In 
the  likelihood-based  approach  (bottom  figure)  the  likelihood  ratio,  which  is 
the  ratio  of  target  to  non-target  densities,  is  thresholded  (e.g.,  at  1),  so  that  in 
general  there  is  more  than  one  correct  detection  and  false  alarm  region. 
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and 


1 


so  that 


G(t;v) 


g(s;  v)ds  =  y, 


F(t;  u)  =  1  —  F(t;  u), 


and 

G(t ;  u)  —  1  —  G(t ;  w), 


(2.6) 


(2.7) 


(2.8) 


where  F(t;  u)  and  G(t:  v)  are  cumulative  probability  distributions. 

If  the  inverse  of  F  exists  for  every  u,  then  the  score-threshold  ROC  curve  is  (by  implicit 
and  inverse  function  theorems;  see  [Olmstead,  1961]) 

y  =  r(x;u,v)>  (2.9) 

where 

r(x;u,v)  —  G(F~1(x;u)-,v).  (2.10) 

Equivalently,  y  =  r(x ;  w )  and  r(x ;  w)  =  G  o  F_1(a;;  w),  where  w  concatenates  u  and  v 
(i.e.,  w  =  [ui  u2  ...  Vi  v2  ...]). 

The  proof  is  in  Appendix  A-l. 

Note  that  if  u  and  v  are  fixed,  they  may  be  removed  in  the  above  formulas  (e.g.,  for  fixed 
u,  f(s )  =  f(s:  u));  however,  retaining  u  and  v  is  important  in  later  ROC  curve  density 
development  where  the  parameters  are  not  fixed.  The  parameters  u  and  v  (or  w) 
characterize  the  target  and  non-target  densities  of  score;  the  Bayesian  approach  does  not 
require  the  assumption  that  such  parameters  are  stochastic  (see  [Gregory,  2005]  and 
[Mac Kay,  2003]),  but  it  is  acceptable  to  handle  them  as  random  variables  (see 
[Schervish,  1995]).  However,  it  is  common  practice  to  simply  refer  to  u  and  v  (or  iu)  as 
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parameters  (see  [Schmitt,  1969]  and  [Kass  and  Raftery,  1995])  or  "random  parameters" 
(see  [Robert,  2001]).  Here  the  term  "parameters"  is  used. 

The  AUC  value  integrates  the  area  under  the  formed  ROC  curve;  an  ideal  AUC  value  is 
one  (see  Equation  (2.4)).  A  large  AUC  value  (an  AUC  value  near  one)  is  due  to 
sufficient  separation  between  the  target  and  non-target  densities,  rather  than  whether  or 
not  the  score  values  are  appropriate  estimates  of  the  probability  of  a  target.  An  analysis 
of  AUC  value  applicability  in  evaluating  pattern  recognition  systems  is  given  by  Alsing 
[Alsing,  2000],  and  additional  analysis  specific  to  AUC  value  applicability  is  provided  by 
Bradley  [Bradley,  1997].  Note  that  the  AUC  value  is  a  number,  but  the  ROC  curve  is  a 
function.  Thus,  the  ROC  curve  is  a  performance  metric  that  generates  one  AUC  value, 
but  a  given  AUC  value  may  be  generated  by  many  different  ROC  curves.  If  most  of  the 
target  density  is  greater  than  some  score  and  if  most  of  the  non-target  density  is  less  than 
this  score,  then  the  AUC  is  close  to  one.  In  this  situation,  the  ROC  curve  does  not 
indicate  whether  or  not  the  scores  are  appropriate  estimates  of  the  probability  that  the 
observed  image  is  a  target,  but  the  CEG  curve  and  the  RSD  value  metric  provide  this 
indication.  For  target  detection  system  evaluation,  the  score-threshold  ROC  curve  plots 
the  probability  of  false  alarm  and  probability  of  correct  detection  values  achieved  by 
varying  a  score  threshold.  However,  this  ROC  curve  does  not  indicate  the  threshold  that 
is  required  to  obtain  a  particular  probability  of  false  alarm  and  probability  of  detection. 
For  some  applications,  it  is  of  interest  to  examine  only  particular  regions  of  the  ROC 
curve;  for  example,  in  cases  where  a  false  alarm  probability  greater  than  a  certain  value  is 
not  relevant. 

Correct  detection  probability  is  used  here  to  refer  to  the  probability  of  correctly  declaring 
a  target  to  be  a  target.  False  alarm  probability  is  used  here  to  refer  to  the  probability  of 
incorrectly  declaring  a  non-target  to  be  a  target.  The  terms  referred  to  here  as  correct 
detection  probability  and  false  alarm  probability  also  have  other  designations.  The  use  of 
the  term  correct  detection  probability  here  can  be  replaced  by  "detection  probability"  or 
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"true  positive  probability".  Similarly,  "false  alarm  probability"  may  be  replaced  by  "false 
positive  probability  (see  [Hill  et  al.,  2003]).  In  medical  research,  "specificity"  and 
"sensitivity"  are  often  used  instead  of  "correct  detection  probability"  and  "false  alarm 
probability";  correct  detection  probability  as  used  here  can  be  substituted  for  sensitivity, 
and  false  alarm  probability  can  be  substituted  for  one  minus  specificity.  The  use  of 
correct  detection  probability  here  reinforces  its  usage  in  "correctly"  declaring  a  target  to 
be  a  target. 

Many  radar  applications  [Hall  et  al.,  1991]  focus  on  low  false  alarm  probabilities,  e.g., 
probabilities  on  the  order  of  10  ~ 1 4  to  10  2  may  be  appropriate  [Raemer,  1997].  In  such 
applications,  estimating  the  uncertainty  of  the  full  ROC  curve  may  seem  to  be  of  limited 
practical  interest.  However,  the  success  of  these  applications  depend  on  detection  system 
performance.  Chapter  1  discussed  the  practical  importance  of  both  low  and  high  false 
alarm  probability  in  specific  examples,  and  interest  in  the  full  range  of  false  alarm 
probabilities  is  consistent  with  recent  target  detection  focused  research  (e.g., 

[Zelnio  et  al.,  2005]).  As  an  additional  example,  consider  an  unmanned  aerial  vehicle 
(UAV),  such  as  the  Global  Hawk  Unmanned  Aerial  Reconnaissance  System.  Global 
Hawk  flies  at  an  altitude  of  65,000  feet,  and  has  two  synthetic  aperture  radar  modes:  wide 
area  search  mode  (1.0  meter  resolution)  and  spot  image  mode  (0.3  meter  resolution) 
[Curiel,  2005].  The  wide  area  search  mode  can  cover  a  wider  area  in  a  fixed  amount  of 
time  than  the  spot  mode  (40,000  square  miles  versus  3,000  square  miles  in  24  hours 
[Humphlett,  2004]),  but  the  wide  area  search  mode  has  lower  resolution 
[Humphlett,  2004]  [Curiel,  2005].  Thus,  Global  Hawk  may  declare  objects  to  be  targets 
of  interest  in  the  wide  area  search  mode  with  high  false  alarm  probability  permitted,  and 
it  then  may  use  the  declarations  to  subsequently  examine  the  objects  more  closely  in  spot 
mode.  Note  that  even  in  spot  mode,  a  high  false  alarm  probability  may  be  acceptable  if 
the  outcome  of  a  target  declaration  results  in  a  closer  examination  by  a  lower  flying 
air-based  or  ground-based  detection  system.  Finally,  note  that  even  for  radar  systems 
with  very  low  false  alarm  probability  requirements,  accurate  performance  at  higher  false 
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alarm  probabilities  may  be  important  for  monitoring  proper  system  function 

[Hall  et  al.,  1991].  The  methods  developed  in  Chapters  3  through  5  are  applicable  to  the 

full  range  of  false  alarm  probability. 


2.3  CEG  cun’es  and  RSD  values 


The  CEG  curve  describes  the  accuracy  (or  relevance)  of  the  target/non-target  score 
values,  that  is,  the  curve  describes  whether  the  target/non-target  score  values  are 
appropriate  estimates  of  the  actual  probability  of  observing  a  target.  In  contrast,  the  ROC 
curve  describes  how  well  the  target  and  non-target  scores  are  separated 
[Wise  et  al.,  2004].  Recall  that  a  SUT  outputs  both  target  and  non-target  scores,  and  if 
the  scores  are  accurate,  then  the  probability  of  target  given  score  equals  the  assigned 
score;  that  is,  if  an  ideal  SUT  generates  100  scores  of  0.6,  then  60  of  these  scores  are 
targets  and  40  are  non-targets.  Here,  "ideal"  refers  to  an  SUT  that  generates  scores 
(estimates  of  probability  of  observing  a  target)  which  always  equal  the  true  probability  of 
observing  a  target  given  the  score. 


The  RSD  value  is  defined  as 


RSD 


(P(T\s) 


s)2p(s)ds, 


(2.11) 


where,  using  Bayes’  rule, 


P(T\s) 


g(s\T)P(T) 

g(s\T)P(T)  +  f(s\N)P(NY 


(2.12) 


and  s  is  a  scalar  random  variable  between  zero  and  one,  s  G  [0, 1],  P(T\s)  the  probability 
of  target  event  given  score,  g(s\T)  is  the  density  of  score  given  target  event,  f(s\N)  is  the 
probability  density  of  score  given  non-target  event,  p(s)  is  the  prior  probability  density  of 
the  score  (without  regard  to  target  or  non-target),  P(T)  is  the  prior  probability  of  target 
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event,  and  P(N )  is  the  prior  probability  of  non-target  event  (P(N)  =  1  —  P(T)).  The 
CEG  curx’e  is  defined  as  a  plot  of  P(Tjs)  versus  score.  Similar  to  the  relation  of  AUC  to 
ROC,  whereas  RSD  is  a  value,  P(T\s)  is  a  function,  and  the  curve  that  it  forms  as  score 
varies  between  zero  and  one  is  the  CEG  curve  as  shown  in  Figure  2.2. 

Note  that  many  distinct  target  and  non-target  densities  result  in  ROC  curves  that  are  close 
to  an  ideal  AUC  value  of  1.  For  example,  choose  any  target  beta  density  mean  and 
non-target  beta  density  mean.  If  the  target  density  standard  deviation  is  sufficiently  small 
and  if  the  target  density  mean  is  greater  than  the  non-target  density  mean,  the  AUC  value 
approaches  one.  For  the  RSD  value,  only  more  specific  special  cases  of  target  and 
non-target  densities  approach  the  ideal  RSD  value  of  zero.  These  special  cases  include: 
(a)  target  density  approaches  an  impulse  function  (i.e.,  a  Dirac  delta  function  density  or 
distribution)  at  a  score  of  1  and  the  non-target  density  approaches  an  impulse  function  at 
a  score  of  0  and  (b)  target  density  and  non-target  densities  approach  impulse  functions  at 
a  score  of  0.5,  and  (c)  the  ratio  of  the  target  density  to  the  non-target  density  is  equal  to 
the  value  of  score  for  all  scores. 

Figure  2.3  illustrates  the  process  that  forms  a  CEG  curve.  The  lower  two  plots  compare 
the  RSD  value  described  by  Equation  (2.11)  with  an  unweighted  RSD,  which  does  not 
depend  on  overall  density  of  score.  The  weighted  RSD  value  used  here  is  generally 
preferable  (see  [Parker  et  al. ,  2005c]),  because  scores  that  occur  infrequently  do  not 
increase  the  RSD  value  in  the  weighted  method.  Figure  2.4  shows  similar  plots,  but  the 
target  and  non-target  densities  in  this  figure  generate  a  more  ideal  CEG  curve  and  a  lower 
RSD  value. 

2.4  Relation  of  performance  metrics  to  SUT  evaluation 

The  objective  here  is  improved  evaluation  of  SUT  performance  and  in  particular  on 
improving  the  ability  to  describe  uncertainty  in  performance.  However,  first  consider  the 
case  where  the  scores  that  a  SUT  outputs  for  a  population  set  of  target  and  non-target 
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Figure  2.2  The  CEG  curve.  The  CEG  curve  describes  the  relevance  of  scores  produced 
by  a  SUT.  For  example,  if  an  ideal  SUT  produces  100  scores  at  values  near 
0.75,  75  of  the  scores  are  targets  and  25  are  non-targets.  The  RSD  value 
summarizes  the  CEG  curve  metric  and  is  the  root-mean- squared  difference 
of  the  probability  of  target  given  score  and  score  weighted  by  the  density  of 
score.  The  ideal  CEG  curve  is  the  dotted  45  degree  line;  an  actual  CEG  curve 
is  shown  by  the  solid  line.  At  its  tails,  the  density  of  score  may  approach 
zero,  yet  the  deviation  of  P(T|s)  from  ideal  at  these  tails  may  be  significant. 
Therefore  the  incorporation  of  the  density  of  score  as  a  weight  is  important. 
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Figure  2.3  Target  and  non-target  densities,  CEG  curves,  and  RSD  values.  As  shown  in 
the  equations  on  the  plot,  a  RSD  value  can  be  weighted  or  unweighted.  The 
weighted  RSD  value  is  affected  by  the  overall  densities  of  score.  The  top  left 
plot  shows  a  target  density  (solid  line)  and  non-target  density  (dashed  line). 
The  top  right  plot  shows  the  CEG  curve  as  the  probability  P(T|s)  of  a  target 
versus  score.  The  bottom  two  plots  show  the  quantities  that  are  integrated  to 
obtain  unweighted  or  weighted  RSD  value.  In  an  ideal  SUT,  P(T|s)  follows 
the  45  degree  line  shown  in  the  top  right  figure. 
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Figure  2.4  As  in  Figure  2.3,  but  for  different  target  and  non-target  densities.  Here 
the  unweighted  RSD  value  and  the  (weighted)  RSD  value  are  approximately 
equal  because  the  regions  where  P(T|s)  deviates  greatly  from  score  (for  ex¬ 
ample,  scores  between  0.01  and  0.3)  also  have  high  overall  score  density. 
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scores  are  known.  In  this  case,  the  exact  ROC  curve  and  exact  CEG  curve  may  be 
calculated,  as  described  in  Section  2.2.  The  exact  ROC  curve  presents  the  full  set  of 
possible  correct  detection  and  false  alarm  probabilities,  and  this  set  indicates  the 
capability  of  a  SUT  to  differentiate  between  target  and  non-target  scores.  The 
score-threshold  ROC  curve  that  is  the  focus  here  (rather  than  a  likelihood-ratio-threshold 
ROC  curve)  provides  the  additional  indication,  through  curve  shape,  of  whether  or  not 
the  SUT  produces  appropriate  output.  For  an  ideal  SUT,  an  increase  in  score  makes  it 
increasingly  likely  that  a  target  is  observed.  A  score-threshold  ROC  curve  reveals  this 
result;  however,  a  likelihood-ratio-threshold  ROC  curve  assumes,  but  does  not  indicate, 
this  behavior.  A  likelihood-ratio-threshold  ROC  curve  is  always  convex;  a 
score-threshold  ROC  curve  is  only  convex  when  an  increase  in  score  increases  the 
probability  of  observing  a  target  for  all  scores.  The  exact  CEG  curve  describes  whether 
or  not  the  scores  provided  by  a  SUT  are  relevant;  that  is,  whether  or  not  the  scores  that  an 
SUT  generates  are  representative  of  the  actual  probability  of  target  given  score.  Further, 
the  combined  examination  of  the  ROC  curve  and  CEG  curve  characteristics  of  a  SUT 
provide  robust  tools  for  comparing  one  SUT  with  another.  The  related  summary  AUC 
and  CEG  values  also  provide  useful  tools  for  comparison;  however,  the  curves  themselves 
enable  particular  probability  of  false  alarm  regions  (in  the  case  of  the  ROC  curve)  and 
particular  score  regions  (in  the  case  of  the  CEG  curve)  to  be  isolated  and  analyzed. 

A  key  motivation  for  this  research  follows  from  the  fact  that  in  practice,  there  is  only  a 
finite,  and  often  small,  set  of  score  samples  available  to  a  target  detection  system.  There 
are  methods  to  estimate  the  ROC  curve  and  CEG  curve  from  such  sets;  however, 
understanding  the  uncertainty  in  the  estimates  may  be  more  important,  particularly  for 
low  numbers  of  score  samples,  than  estimating  the  most  likely  ROC  and  CEG  curves 
(such  maximum-likelihood  estimates  are  inherently  inaccurate  for  low  numbers  of 
samples;  see  the  discussion  in  Section  3.1). 
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This  research  focuses  on  improved  methods  of  estimating  uncertainty  in  ROC  curves, 
and  then  extends  the  development  to  CEG  curves.  As  discussed  in  the  literature  review 
of  Section  2.7,  current  methods  of  ROC  curve  uncertainty  estimation  make  unacceptable 
assumptions  or  are  only  appropriate  when  sample  size  is  very  large,  and  thus  existing 
methods  are  not  suitable  to  extend  to  the  CEG  curve. 

The  ROC  curve  uncertainty  estimation  methods  developed  here  provide  results  that  can 
be  compared  with  results  in  the  existing  literature  and  that  can  then  be  extended  to  the 
CEG  curve  uncertainty  estimation  problem.  The  techniques  developed  here  are 
unprecedented  in  ROC  curve  uncertainty  estimation  (see  the  literature  review  of  Section 
2.7  and  related  quantitative  ROC  curve  confidence  interval  comparisons  of  Chapter  5). 
Further,  the  ROC  curve  confidence  interval  framework  makes  flexible  assumptions;  even 
when  quantitative  comparisons  with  other  methods  appear  somewhat  comparable,  these 
methods  generally  have  unacceptable  weaknesses  (see  Chapters  2  and  5).  Note  that  if 
only  a  "best  estimate"  of  the  ROC  curve  is  required,  there  are  suitable  alternatives  to  the 
method  developed  here  (e.g.  maximum  likelihood),  particularly  when  the  prior 
probability  densities  are  diffuse.  While  the  ROC  curve  and  CEG  curve  are  estimated  by 
the  method  developed  here,  obtaining  these  curves  is  not  the  primary  motivation.  The 
method  developed  here  focuses  on  uncertainty  estimation,  and  the  primary  description 
for  such  uncertainty  estimation  here  (and  in  the  literature)  is  confidence  intervals. 
Confidence  intervals  are  important  because  for  the  low  numbers  of  samples  that  are 
typical  for  target  detection  applications,  any  best  (e.g.,  maximum  likelihood)  estimate  of 
the  ROC  curve  may  not  be  close  to  the  actual  curve.  Thus  confidence  intervals  are  of 
practical  interest  because  they  provide  a  description  of  the  range  of  possible  values  for  a 
ROC  curve  if  large  (approaching  infinite)  sets  of  samples  were  actually  tested. 

The  beta  probability  density,  while  possessing  many  desirable  qualities  for  the  methods 
developed,  is  only  an  example.  It  is  the  density  that  has  maximum  entropy  among  all 
densities  that  are  non-zero  on  a  fixed  interval,  subject  to  specific  constraints  (see 
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[Gokhale,  1975])  that  may  be  related  to  mean  and  variance.  However,  the  analytical 
expressions  developed  in  Chapter  3  are  general  and  may  be  applied  to  alternative  density 
models. 

2.5  Bayesian  probability  densities 

The  methods  developed  here  use  a  fully  Bayesian  framework  to  develop  probability 
densities  for  ROC  curves  and  other  target  detection  performance  metrics.  A  Bayesian 
framework  incorporates  input  samples  (such  as  target  and  non-target  samples),  model 
(such  as  assuming  that  the  samples  are  modeled  with  a  Gaussian  density),  model 
parameters  (such  as  mean  and  standard  deviation),  and  prior  density  assumptions  (for 
example,  assuming  uniform  prior  probabilities  of  means  from  zero  to  one  and  standard 
deviations  from  zero  to  two).  Then,  the  Bayesian  framework  combines  such  inputs  and 
assumptions  and  produces  a  posterior  probability  density  of  an  output  of  interest,  such  as 
the  ROC  curve  here.  Note  that  the  posterior  probability  density  may  be  updated  if  more 
input  samples  are  available,  but  that  this  density  is  the  actual,  complete  solution  for  the 
available  samples,  model,  and  priors  (see  [MacKay,  2003]  and  [Carlin  and  Louis,  2000]). 
In  developing  the  posterior  probability  density  (which  the  Bayesian  framework  makes 
possible),  the  observed  data  samples  are  handled  as  fixed  known  input  observations.  In 
alternative  (frequentist-based)  approaches,  there  is  an  upfront  focus  on  describing  the 
randomness  of  the  data  samples  (e.g.,  using  probability  statements  and  confidence 
intervals),  thus  making  estimates  about  what  samples  might  have  been  produced  if  more 
samples  were  available.  These  estimates  are  then  applied  to  make  follow-up  statements 
about  the  result  of  interest  (the  ROC  curve  and  CEG  curve  in  the  case  of  this  research). 

In  contrast,  in  a  Bayesian  framework  it  is  the  evaluated  model  parameters  that  are 
handled  as  unknown  parameters  (see  discussion  in  Section  2.2  and  [Bolstad,  2004]). 
Neither  of  the  two  methods  ignores  uncertainty;  both  frequentist-based  and  Bayesian 
methods  make  attempts  to  quantify  uncertainty.  However,  a  benefit  of  the  Bayesian 
framework  is  that  it  permits  the  progressive  development  of  a  full,  complete,  posterior 


2-18 


probability  density  for  the  result  of  interest  (e.g.,  development  of  the  posterior  probability 
densities  for  the  ROC  curve  and  CEG  curve  in  the  case  of  this  research)  prior  to  the 
development  of  further  descriptions  such  as  confidence  intervals,  median  estimates,  etc. 
This  developed  posterior  probability  density  fully  describes  the  uncertainty  of  the  result 
of  interest  based  on  the  available  observed  data  samples,  and  the  model,  and  prior 
knowledge.  Gregory  [Gregory,  2005]  provides  a  detailed  discussion  and  further 
comparison  of  frequentist-based  and  Bayesian  approaches.  A  similar  framework  was 
developed  in  the  early  1990s  for  neural  network  applications  (see  [Mac Kay,  1992a, 
1992b]  and  [Bishop,  1995]);  however  it  has  not  previously  been  applied  to  target 
detection  performance  metrics.  The  densities  developed  using  the  framework  are 
characterized  here  by  descriptive  statistics,  such  as  median  estimates,  confidence 
intervals  for  ROC  curves,  and  also  by  statistics  that  characterize  the  accuracy  of  the 
confidence  intervals.  Descriptive  statistics  may  be  contrasted  with  inferential  statistics  in 
that  they  simplify  but  do  not  attempt  to  extend  beyond  the  immediate  data  (see 
[Huntsberger,  1961]  and  [Trochim,  2005]).  Thus  confidence  bands  are  descriptive 
statistics  used  to  summarize  the  developed  probability  densities;  the  bands  do  not  extend 
the  data  provided  by  the  densities.  The  density  generation  and  characterization  process  is 
also  applied  to  CEG  curves,  and  it  may  be  applied  to  other  metrics. 

The  framework  requires  density  models  for  target  and  non-target  detection  system  output 
and  prior  densities  for  model  parameters.  The  Bayesian  approach  incorporates  all  that  is 
known  or  assumed  about  the  data  and  density  models.  For  a  given  set  of  target  samples 
and  non-target  samples,  assumed  sample  density  models,  and  assumed  prior  densities, 
the  Bayesian  formalism  permits  the  development  of  a  ROC  curve  density.  Other 
descriptive  statistics,  such  as  the  ROC  curve  confidence  intervals,  may  then  be  developed, 
if  desired,  from  this  probability  density.  Other  methods  focus  up  front  on  descriptive 
statistics  (e.g.,  the  mean  and  standard  deviation  of  the  target  and  non-target  samples); 
such  methods  force  premature  simplification  of  the  data;  and  either  do  not  account  for  the 
model  assumptions  and  priors  densities  or  can  only  account  for  them  in  an  ad  hoc 
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manner.  A  Bayesian  framework,  by  marginalizing  over  all  possible  models,  provides  a 
more  robust  estimate  for  a  single  set  of  data  than  other  estimation  methods.  Methods 
other  than  Bayesian  may  perform  well  for  large  numbers  of  samples,  but  are  less 
competitive  for  low  numbers  of  samples.  A  ROC  curve  estimated  by  a 
maximum-likelihood  approach  is  more  accurate  as  the  number  of  samples  increases  (see 
general  discussion  by  [Robert,  2001,  pp.  16]),  but  can  not  be  relied  upon  for  low  numbers 
of  samples.  Non-Bayesian  approaches  can  have  superior  performance  if  the  Bayesian 
framework  incorporates  inappropriate  model  selection  or  prior  density  selection. 

The  Bayesian  approach  possesses  two  major  strengths.  First,  it  naturally  and  fully 
incorporates  all  possible  model  parameter  values  by  marginalization  (i.e.,  weighted 
averaging  over  all  possibilities).  The  Bayesian  approach  avoids  descriptive  statistics 
until  the  parameters  that  are  not  of  direct  interest  are  integrated  out  and,  thus,  fully 
accounted  for.  In  contrast,  a  maximum-likelihood  approach  attempts  to  find  the  “best” 
parameters  (e.g.,  leading  to  a  single  best  ROC  curve).  The  maximum-likelihood  based 
approach  must  then  make  additional  assumptions  (perhaps  normal-based)  to  describe 
uncertainty.  Bayesian  approaches  are  more  tolerant;  the  focus  is  not  on  finding  a  true 
single  answer  (see  [Morgan,  1968,  pp.  109]),  but  instead  on  describing  the  range  of  all 
possible  answers  in  the  form  of  a  probability  density,  which  is  then  more  easily 
transitioned  to  other  descriptive  uncertainty  statistics.  Second,  the  Bayesian  approach 
naturally  incorporates  the  use  of  prior  densities;  that  is,  it  permits  the  incorporation  of 
subjective  probability  estimates  into  its  framework,  which  is  particularly  critical  when 
sample  size  is  small  (see  [Good,  1965,  pp.  ix]). 

Bayes  estimators  that  perform  point  estimation,  rather  than  the  broader  uncertainty 
estimation  that  is  the  focus  of  this  research,  are  well  known.  Bayes’  estimators  can  be 
fully  consistent  with  traditional  means  of  estimation,  such  as  minimum  mean  square  error 
(MMSE)  and  maximum  a  posterior  (MAP)  estimation  (see  [Scharf,  1991]).  Robert 
[Robert,  2001]  states  that  a  Bayesian  approach  is  consistent  with  three  tests  for 
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optimality  from  a  non-Bayesian  perspective:  minimaxity,  admissibility,  and  equivariance. 
Minimaxity  typically  consider  the  worst  case  scenario,  but  in  contrast  to  frequentist-based 
approaches,  a  Bayesian  approach  prevents  unwarranted  reliance  on  a  worst  case  scenario 
that  has  little  chance  of  occurring  (see  [Robert,  2001,  pp.  67],  [Leonard  and  Hsu,  1999, 
pp.  146],  [Schervish,  1995,  pp.  167],  and  [Duda  et  al.,  2001,  pp.  28]).  Admissibility 
focuses  on  whether  or  not  there  exists  a  better  decision  rule  (see  [Ferguson,  1967,  pp.  54] 
and  [Lehmann  and  Casella,  1998,  pp.  323])  than  the  one  selected.  Equivariance  relates 
to  whether  or  not  an  estimate  is  invariant  under  linear  transformation  (see  [Lehmann, 
1998,  pp.  161,  245]).  Robert  [Robert,  2001]  shows  that  Bayesian  estimators  are  a 
specific  and  preferred  class  of  admissible  estimators  (see  also  [Schervish,  1995]). 

For  further  discussion  on  the  advantages  of  Bayesian-based  approaches  over  more 
traditional  methods,  see  [Good,  1965],  [Schmitt,  1969],  [Lindley,  1972], 

[Antelman,  1997],  [Leonard  and  Hsu,  1999],  [Robert,  2001],  and  [Woodworth,  2004]. 

2.6  Performance  metric  densities  and  confidence  bounds 

Figure  2.5  extends  the  relationships  indicated  in  Figure  1.2  from  simply  identifying  the 
performance  metrics  to  formulation  of  probability  densities  of  performance  metric  curves 
and  values.  It  indicates  three  types  of  inputs:  target  and  non-target  samples,  model 
specification,  and  sampling  protocol. 

As  will  be  discussed  in  detail  in  Chapter  3,  a  reasonable  model  specification,  if  the 
sample  scores  are  between  zero  and  one,  is  a  beta  density.  The  beta  density  is  specified 
by  two  parameters,  mean  and  standard  deviation.  The  model  specification  also  includes 
prior  assumptions  for  the  parameters;  for  example,  prior  assumptions  may  be  uniform 
prior  densities  for  the  mean  and  standard  deviation  over  their  allowed  domains  (the 
admissible  set,  defined  in  Chapter  3,  specifies  the  allowed  domains).  Another  example 
model  is  a  truncated  Gaussian  density  with  uniform  prior  mean  and  variance  (rather  than 
uniform  prior  mean  and  standard  deviation).  The  sampling  protocol  is  also  selected,  but 
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Figure  2.5  Uncertainty  estimation  process.  Data  (such  as  a  set  of  30  target  and  30  non¬ 
target  scores),  Model  Specification  (such  as  beta  probability  densities  for 
target  and  non-target  and  uniform  prior  densities  for  their  means  and  stan¬ 
dard  deviations),  and  Sampling  Protocol  (such  as  uniform  density  of  points 
from  the  prior  densities),  are  inputs  to  a  Bayesian  Process.  Outputs  are  prob¬ 
ability  densities  for  receiver  operating  characteristic  (ROC)  and  confidence 
error  generation  (CEG)  curves.  These  densities  are  characterized  by  plots 
that  involve  descriptive  statistics,  including  histograms  of  area  under  receiver 
operating  characteristic  (AUC)  and  root  square  deviation  (RSD)  values  from 
the  ideal  CEG  curve,  and  also  median  ROC  and  CEG  curves  and  correspond¬ 
ing  curves  that  bound  90%  of  the  probability  density. 
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results  which  are  not  sensitive  to  this  selection  (and  that  approach  an  analytical  solution; 
see  Chapter  3)  are  obtained  provided  that  a  fine  enough  spacing  in  target  and  non-target 
parameter  density  is  used.  Monte  Carlo  methods  may  also  be  employed  to  generate 
points  that  sample  the  target  and  non-target  density  parameter  values  (see  also  Chapter 

3). 

The  outputs  of  the  Bayesian  process  (such  a  process  accounts  for  all  input  data,  prior 
densities,  and  integrates  out  free  parameters  through  marginalization  techniques) 
indicated  in  Figure  2.5  include  performance  metric  densities  (for  example,  ROC  and 
CEG  curve  densities).  The  developed  densities  can  be  considered  actual  posterior 
probability  densities  (see  [Carlin  and  Louis,  2000,  pp.  35-36]  for  a  discussion  of  actual 
probability  statements)  for  the  input  samples  (which  are  assumed  independent  and 
identically  distributed  for  the  research  reported  here),  assumed  density  model,  and  prior 
densities  of  the  model  parameters.  Although  they  are  actual  probability  statements  based 
on  available  samples,  the  developed  probability  densities  are  expected  to  change  for  more 
samples  or  different  sets  of  samples.  From  a  Bayesian  standpoint,  posterior  probabilities 
are  subjective  and  "quantify  degrees  of  beliefs"  (see  [Mackay,  2003,  pp.  26,  50]),  so  the 
developed  posterior  probability  densities  do  not  necessarily  encompass  truth  if  the  model 
or  priors  are  incorrect.  Alternatively,  if  the  selected  model  or  priors  are  considered 
estimates,  then  the  posterior  probability  densities  may  be  considered  estimates.  Here, 
since  the  focus  is  on  consistency  with  recent  Bayesian  literature,  the  term  "posterior 
probability  densities"  rather  than  "posterior  probability  density  estimates"  is  used.  Note 
that  Chapter  3  describes  the  performance  metric  density  generation  method,  which  is 
fully  Bayesian  in  that  it  accounts  for  all  assumptions  and  data  and  integrates  out  free 
parameters  through  marginalization.  After  performance  metric  densities  are  produced, 
probability  density  characterization  produces  descriptive  statistics  for  the  ROC  curve, 
CEG  curve,  AUC  value,  and  RSD  value,  as  described  and  verified  in  Chapter  4.  The  four 
figures  that  form  the  rightmost  column  of  Figure  2.5  show  such  statistics. 
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2.7  Literature  review 


Next,  a  literature  review  on  methods  for  ROC  curve  estimation  is  presented;  then  a 
review  on  related  confidence  interval/confidence  band  methods  is  provided.  Existing 
approaches  have  unacceptable  weaknesses  (e.g.,  they  are  only  effective  for  large  sample 
sizes,  are  restrictive  to  particular  ROC  curve  shape,  or  make  other  unacceptable 
assumptions).  The  inadequacy  of  methods  in  the  literature  are  identified  here  so  that  the 
benefits  of  the  full  Bayesian  framework  that  are  described  in  Chapters  3  and  4  can  be 
better  appreciated;  the  literature  review  provided  here  is  not  necessary  to  understand  the 
method  developed  in  Chapters  3  and  4.  Later,  Chapter  5  provides  quantitative 
comparison  of  methods  in  the  literature  to  the  method  that  is  developed  here.  Also,  the 
CEG  curve  literature  is  reviewed;  however,  existing  CEG  curve  literature  does  not 
provide  adequate  means  of  uncertainty  estimation.  The  Metz  [Metz  et  al.,  1998]  method 
is  examined  as  a  primary  example.  Then,  other  methods  of  ROC  curve  estimation  and 
ROC  curve  confidence  interval  estimation  are  examined. 

2.7.1  Metz  method.  The  Metz  method,  based  on  binormal  ROC  curve  theory,  is 
implemented  in  a  software  package  called  ROCKIT;  ROCKIT  is  perhaps  the  most  widely 
accepted  ROC  curve  confidence  interval  software  available  today  (see  [Eng,  2005]). 
Binormal  ROC  curve  theory  assumes  that  the  target  and  nontarget  variables  (referred  to 
as  diseased  or  non-diseased  in  the  medical  literature)  are  either  normal  or  can  be  made 
normal  after  some  unknown  transformation.  Binormal  ROC  curve  development  requires 
that,  rather  than  plotting  the  ROC  curve  along  correct  detection  probability  and  false 
alarm  probability  axes  that  are  both  uniform  between  zero  and  one,  the  axes  use  a  linear 
scaling  along  normal  deviate  values,  and  this  scaling  is  therefore  non-uniform  between 
zero  and  one  [Dorfman  and  Alf  Jr.,  1968,  1969],  [Swetz  and  Pickett,  1982],  and 
[McNeil  and  Hanley,  1984]).  Once  the  ROC  curve  is  estimated  as  a  straight  line  in 
normal  deviate  space,  the  ROC  curve  is  then  transformed  into  standard  axes  that  are 
uniform  between  zero  and  one.  Generally  the  curve,  after  being  transformed  into  the 
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standard  axes,  has  a  convex  appearance  (although,  as  detailed  later,  the  curve  can  have  a 
“hook”  that  is  especially  apparent  for  small  numbers  of  samples). 

Historically,  the  binormal  approach  is  the  most  common  in  the  literature  for  rating  scale 
data  [Hanley,  1999].  Rating  scale  data  are  broken  down  into  a  number  of  distinct 
categories  (typically  five)  in  contrast  to  data  described  on  a  continuous  scale.  With  five 
categories,  five  ROC  points  are  plotted  on  the  normal  deviate  plot  described  above. 

Upon  conversion  back  to  a  scale  that  is  uniform  from  0  to  1  for  both  false  alarm 
probability  and  correct  detection  probability,  the  line  becomes  the  ROC  curve.  Note  that 
because  of  assumptions  due  to  plotting  on  the  normal  deviate  axis,  it  is  inappropriate  to 
fit  a  least  squares  line  to  find  the  slope  and  intercept  in  the  normal  deviate  space  that  best 
represents  the  ROC  curve.  Instead,  a  maximum  likelihood  method  is  used.  Dorfman 
[Dorfman  and  Alf  Jr.,  1968,  1969]  proposes  a  widely  accepted  method  that  estimates  the 
ROC  curve  in  such  a  manner.  For  an  alternative  maximum  likelihood  estimation 
development,  see  [Metz,  1984]. 

Metz  [Metz  et  al. ,  1998]  develops  an  algorithm  that  extends  the  binormal  approach  to  a 
large  number  of  distinct  categories,  and  therefore  permits  application  of  the  binormal 
approach  to  a  continuous  scale. 

Metz  [Metz  et  al.,  1998]  (and  Swets  [Swetz  and  Pickett,  1982])  alleviates  the  need  to 
estimate  the  target  and  nontarget  distributions  directly.  Metz  found  that  the  binormal 
approach  provides  satisfactory  ROC  fits  to  data  generated  in  a  “very  broad  variety  of 
situations”. 

Here  we  consider  what  “broad  variety  of  situations”  means  in  a  medical  context.  In  the 
medical  decision  community,  it  is  assumed  that  by  measuring  a  known  marker  (from  a 
blood  test,  for  example)  which  indicates  a  disease,  that  the  likelihood  of  disease  in  all 
cases  is  monotonically  increasing  (or  decreasing)  as  marker  level  increases.  For  a  target 
detection  system  under  test,  this  is  clearly  not  necessarily  the  case  (while  the  monotonic 
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property  is  desirable  for  a  system  under  test,  one  of  the  primary  reasons  for  estimating  the 
entire  ROC  curve  is  to  determine  if  it  is  true,  not  to  assume  that  it  is  true).  Therefore,  an 
assumed  binormal  ROC  curve  fit  has  weaknesses  for  target  detection  system  evaluation. 

Many  applications  that  rely  on  binormal  theory  actually  are  interested  primarily  in  the 
Area  under  the  ROC  curve  (AUC)  accuracy  rather  than  the  curve  itself.  The  binormal 
ROC  curve  is  a  good  estimate  of  AUC  value,  but  is  recognized  as  being  of  less  utility 
when  attempting  to  estimate  an  unknown  ROC  shape.  Hajian-Tilaki 
[Hajian-Tilaki  el  al.,  1997]  concludes  that  a  binormal  model  is  a  robust  method  for 
determining  AUC.  However,  they  state  that  other  indices,  such  as  true-positive 
estimation  fraction  at  a  specific  false-positive  fraction  point,  might  be  more  sensitive  to 
departures  from  binormality. 

The  binormal  ROC  has  recognized  limitations,  particularly  for  small  numbers  of 
samples.  In  general  for  many  medical  diagnostic  scenarios,  there  is  a  large  amount  of 
sample  data.  So,  requiring  large  sample  sizes  as  a  precondition  may  be  reasonable  for 
the  medical  community.  The  originator  of  binormal  ROC  maximum- likelihood  theory, 
Dorfman  [Dorfman  el  al.,  1997],  states  that  the  binormal  ROC  is  not  robust  in  small 
sample  sets  (Metz  was  a  coauthor  of  the  1997  paper).  Further,  a  study  by  Obuchowski 
[Obuchowski  and  Lieber,  1998]  is  unsupportive  of  the  usefulness  of  a  binormal  ROC 
curve  model  (and  other  alternative  ROC  curve  models)  in  estimating  accurate  confidence 
intervals  in  studies  with  small  sample  sizes. 

Because  of  recognized  inaccuracies  in  the  binormal  ROC  when  the  true  unknown  ROC  is 
assumed  to  be  convex  (the  transformation  from  a  linear  plot  in  normal  deviate  space 
results  in  a  ‘hook’  that  can  be  particularly  prevalent  for  small  numbers  of  samples),  Metz 
and  Dorfman  [Dorfman  et  al.,  1997]  [Metz  and  Pan,  1999]advocate  the  development  of  a 
correction  factor.  Thus,  it  is  recognized  that  even  for  the  general  assumptions  for  which 
binormal  ROC  theory  is  applicable,  there  are  limitations.  The  desire  to  remove  “the 
hook”  has  its  origin  in  the  assumption  that  the  likelihood  of  observing  a  target  increases 
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monotonically  as  the  target  score  increases  -  i.e.,  the  assumption  that  the  appropriate 
model  for  a  ROC  is  a  convex  shape.  This  assumption  is  not  appropriate  for  ROC  curves 
that  evaluate  a  target  detection  system  utility. 

ROCKIT,  which  will  be  later  used  to  provide  in  the  course  of  comparisons  with  the 
method  developed  here,  takes  target  and  non-target  sample  inputs  (either  from  user 
created  files  or  from  keyboard  input).  The  user  must  specify  whether  such  sample  inputs 
be  handled  on  a  continuous  scale  or  on  a  ratings  scale,  and  the  user  must  specify  whether 
high  or  low  scores  values  refer  to  targets.  Then,  ROCKIT  produces  an  output  file  that 
contains  estimates  for  points  on  the  ROC  curve  (generally  false  alarm  probabilities  of 
0.05,  0.01,  0.02, ...,  0.10,  0.20,  0.90,  0.95),  AUC  value,  estimates  for  the  binormal 
parameters  that  are  used  to  form  the  ROC  curve,  95%  confidence  intervals  for  the  ROC 
curve,  uncertainty  estimates  for  the  AUC  value,  and  uncertainty  estimates  for  the 
binormal  parameters. 

Chapter  5  provides  a  full  comparison  of  the  method  developed  here  with  the  Metz 
approach  described  above.  The  weaknesses  of  the  Metz  method  compared  with  the 
method  developed  here  is  even  more  apparent  in  the  comparison  provided  by  Chapter  5. 

2.7.2  Other  existing  methods.  Figure  2.6  diagrams  methods  in  the  literature  which 
estimate  ROC  curves.  The  oval  regions  identify  fundamental  techniques  that  estimate 
ROC  curves  and  compute  ROC  curve  uncertainty,  and  the  unshaded  rectangular  regions 
identify  authors,  years,  and  approaches.  The  shaded  rectangular  regions  identify 
available  ROC  curve-related  software,  where  the  arrows  to  the  software  indicate  the 
approaches  they  employ.  Practical  use  of  a  SUT  that  is  described  by  a  ROC  curve 
requires  the  selection  of  a  threshold.  Unless  the  underlying  non-target  density  is 
deterministic,  there  is  uncertainty  in  which  false  alarm  probability  corresponds  with  a 
particular  threshold.  Greenhouse  [Greenhouse  and  Mantel,  1950]  forms  bounds  to 
describe  this  type  of  uncertainty.  Linnet  [Linnet,  1987]  extends  this  evaluation  to  ROC 
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Figure  2.6  Relevant  ROC  curve  literature  and  software.  The  figure  shows  an  overview 
of  relationships  of  ROC  curve  estimation  and  confidence  interval  develop¬ 
ment  available  in  the  literature.  Underlying  processes  (not  necessarily  spe¬ 
cific  to  ROC  curves)  are  typically  leveraged  to  estimate  the  form  of  ROC 
curves.  The  oval  regions  identify  fundamental  ROC  curve  estimation  tech¬ 
niques  (e.g.,  binomial,  binormal,  kernel,  empirical).  The  estimation  tech¬ 
niques  permit  the  calculation  of  confidence  intervals.  The  lines  indicate  re¬ 
lations  among  methods.  The  relations  are  only  between  the  line  origination 
points  and  the  end  points  indicated  by  arrows.  Several  software  packages, 
indicated  by  shaded  boxes,  apply  particular  ROC  curve  estimation  processes 
and/or  ROC  curve  uncertainty  estimation  processes  (e.g.,  ROCKIT,  Med- 
Calc). 
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curves,  and  Schafer  [Schafer,  1994]  builds  on  work  by  Linnett  and  Wieand 

[Wieand  et  al.,  1989].  A  disadvantage  of  the  Greenhouse  bounds  is  that  such  uncertainty 

in  false  alarm  probability  is  assumed  to  follow  a  normal  distribution. 

Hilgers  [Hilgers,  1991]  details  a  method  that  generates  confidence  bounds  for  ROC 
curves  based  on  binomial  proportions.  He  applies  ordered  statistics  to  obtain  confidence 
intervals  given  an  interval  range  of  interest  (e.g.,  90%)  for  each  of  a  set  of  samples.  For 
example,  if  there  are  five  target  samples,  he  estimates  the  lowest-valued  sample  for  a 
two-sided  90%  confidence  interval  to  be  between  0.02  and  0.53  of  the  overall  cumulative 
distribution  function  (CDF)  for  target.  He  then  estimates  the  second-lowest- valued 
sample  for  the  same  two-sided  90%  confidence  interval  to  be  between  0.07  and  0.70  of 
the  overall  CDF  for  target.  Finally,  he  combines  the  estimates  to  obtain  confidence 
intervals  for  probability  of  correct  detection  and  probability  of  false  alarm.  A  constraint 
on  the  Hilgers  approach  is  that  the  confidence  intervals  are  "pointwise"  and  describe  the 
range  for  a  single  point  on  the  ROC  curve.  Hilgers  extends  these  bounds  to  a  confidence 
band  by  using  a  progression  of  rectangles  based  on  the  pointwise  confidence  intervals. 
However,  Schafer  [Schafer,  1994]  shows  that  this  procedure  leads  to  an  estimated  bound 
larger  than  90%.  An  advantage  of  the  Hilgers  approach  is  that  it  generates 
‘distribution-free’  confidence  bounds,  unlike  many  approaches  (most  of  which  require 
some  assumptions  such  as  binormal  target/non-target  densities).  Examples  considered  in 
Section  5.4  are  consistent  with  Schafer  in  that  the  bounds  are  wide  compared  with  the 
approach  developed  here. 

Non-parametric  approaches  develop  ROC  curves  analytically  and  do  not  assume  a  form 
for  the  underlying  distributions.  Zou  [Zou  et  al.,  1997]  provides  an  example  which  uses 
a  Parzen  window-like  data  transformation,  referred  to  as  kernel  density  estimation 
[Silverman,  1986].  Kernel  density  estimation  enables  ROC  curve  construction  using  a 
smoothed  histogram.  Zou  leverages  Silverman  to  describe  the  kernel  density  estimation 
of  target  or  non-target  density  as 


2-29 


(2.13) 


i= 1 

where  k  is  the  kernel  density,  m  is  the  number  of  samples,  A,  is  the  / th  sample  in  M,  and 
h  >  0  is  the  kernel  width.  Zou  indicates  that  estimating  /,  in  effect,  places  at  each  A/  in 
the  sample  an  enclosed  curve  with  area  1/m,  where  each  curve  has  a  shape  described  by 
the  function  k  and  scaled  by  the  width  h.  The  curves  are  then  added  with  the  goal  of 
obtaining  a  smooth  but  accurate  histogram.  With  kernel  density  estimation,  the  function 
chosen  for  k  is  somewhat  arbitrary,  as  is  the  selection  of  function  width.  Improved 
methods  for  width  selection  are  desirable,  but  the  optimization  process  is  subjective.  For 
example,  Hall  [Hall  and  Hyndman,  2003]  explores  methods  for  improving  bandwidth 
selection,  and  Hall  [Hall  et  al. ,  2004]  considers  a  method  that  makes  width-dependent 
assumptions  and  generates  results  based  on  kernel  estimation.  The  results  of  Hall  show 
potential  for  significant  degradation  as  false  alarm  probabilities  approach  0  or  1  (these 
degradations  are  quantitatively  compared  with  the  method  developed  here  in  Chapter  5). 
Sorribas  [Sorribas  et  al. ,  2002]  introduces  a  S-distribution  that  is  related  to  kernel  density 
estimation  methods,  and  Campbell  [Campbell  and  Ratnaparkhi,  1993]  estimates  ROC 
curves  based  on  the  Lomax  distribution;  neither  approach  introduces  new  methods  of 
confidence  interval  development. 


In  principle,  the  goal  of  empirical  approaches  is  to  estimate  ROC  curves  without  making 
distribution  assumptions.  Claeskens  [Claeskens  et  al.,  2003]  is  the  most  recent  among 
many  authors  who  consider  empirical  ROC  curve  estimation.  As  is  typical,  Claeskens 
recognizes  the  need  for  a  smooth  ROC  curve  and  he  uses  kernel  smoothing  estimation, 
which  thus  introduces  some  distribution  assumptions.  Claeskens  presents  confidence 
regions  for  ROC  curves  with  definitions  similar  to  those  of  Hilgers  that  involve  the 
regions  of  uncertainty  for  both  correct  detection  probability  and  false  alarm  probability  at 
a  given  threshold.  Claeskens  discusses  other  confidence  interval  descriptions,  but  reverts 
to  a  bootstrap  confidence  interval  estimation  method  when  these  confidence  intervals  are 
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calculated.  Earlier  approaches  in  the  empirical  category  are  considered  by  Hsieh 
[Hsieh  and  Turnbull,  1996],  and  a  local  smoothing  technique  is  investigated  by  Qiu 
[Qiu  and  Le,  2001];  however,  neither  approach  fully  develops  confidence  intervals. 

Ma  [Ma  and  Hall,  1993]  applies  the  Working-Hotelling  hyperbolic  confidence  band  for 
multiple  regression  surfaces  to  ROC  curves;  they  generate  pointwise  confidence  bands  by 
varying  correct  detection  probability  and  mapping  a  band  of  intervals  for  false  alarm 
probability  and  also  simultaneous  confidence  bands  for  the  entire  ROC  curve.  Some 
limitations  of  this  approach  are  that  the  confidence  bands  for  the  entire  ROC  curve 
assume  binormality,  and  the  method  uses  rating  scale  data.  However,  their  approach 
extends  to  multiple  confidence  interval  and  confidence  band  definitions,  and  they 
emphasize  the  need  for  such  definition  flexibility.  Although  Ma  claims  that  the 
Working-Hotelling  approach  extends  beyond  binormal  methods,  confidence  bands  are 
obtained  using  conventional  binormal  assumptions  applied  to  ratings  scale  data.  Further, 
the  Working  Hotelling  approach  applies  only  when  the  assumptions  made  permit  the  use 
of  regression  lines. 

Confidence  intervals  may  be  generated  using  various  resampling  methods,  even  if 
different  methods  develop  the  ROC  curve  estimates.  Examples  are  in 
[Zhou  and  Qin,  2005],  [Platt  etal.,  2000],  [Jensen  et  al.,  2000],  [Mossman,  1995], 
[Campbell,  1994],  [Garber  et  al.,  1994],  and  [Simpson  etal.,  1989].  Efron 
[Efron  and  Tibshirani,  1993]  details  general  bootstrap  theory  that  is  often  leveraged  in 
ROC  curve  resampling  processes  (see  Mossman  [Mossman,  1995]  and  Jensen 
[Jensen  et  al.,  2000]).  The  confidence  interval  results  are  generally  jagged  in  appearance 
(as  shown  in  Figure  5.3  of  Chapter  5),  and  the  coverage  areas  are  inaccurate  for  low 
numbers  of  samples,  particularly  in  regions  of  low  correct  detection  probability  density. 

Lloyd  [Lloyd,  2002]  implements  bootstrap  confidence  methods  by  evaluating  ROC  curve 
definitions  many  times  in  a  Monte  Carlo  approach.  He  obtains  confidence  intervals  using 
a  maximum  likelihood  approach  to  estimate  the  ROC  curves  parametrically  and 
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non-parametrically.  He  does  not  verify  the  coverage  accuracy  of  the  bootstrap  method, 
and  he  cautions  that  bias  may  be  a  significant  disadvantage  for  small  samples. 

Once  target  and  non-target  data  are  obtained,  Tilbury  [Tilbury  et  al. ,  2000]  asks  for  every 
point  on  the  ROC  curve,  “If  this  point  represents  the  true  Hit  Rate  and  False  Alarm  Rate 
of  the  population,  what  would  be  the  probability  of  getting  the  sample  actually  obtained.” 
He  analyzes  one  point  (false  alarm  probability  and  correct  detection  probability  at  one 
selected  threshold)  on  the  ROC  curve,  then  he  considers  a  combined  approach  for  four 
selected  thresholds.  For  just  four  points,  he  obtains  a  solution  based  on  an 
eight-dimensional  hyperboundary,  where  increasing  the  number  of  initial  points  on  the 
ROC  curve  increases  the  dimensions  needed.  He  suggests  estimating  ROC  curve  density 
by  selecting  a  point  on  the  ROC  curve  and  finding  the  likelihood  that  given  samples 
(assuming  a  threshold)  are  generated  if  this  point  is  from  the  underlying  densities 
(consistent  with  Hilgers-like  binomial  based  approach).  Tilbury  requires  an  expansion  of 
dimensionality  based  on  the  number  of  samples. 

Although  Tilbury’s  approach  remains  tractable  if  a  few  selected  thresholds  are  permitted 
(through  grouping  of  data),  Macskassy  [Macskassy  and  Provost,  2004]  declares  Tilbury’s 
method  not  tractable  for  more  than  ten  points.  Tilbury  provides  updates  to  his  work 
[Tilbury  2002,  2003a,  2003b]  that  emphasize  the  importance  of  Bayesian  statistics  in 
ROC  curve  analysis,  and  he  uses  Bayes’  rule  in  considering  the  descriptions  of  the  2000 
paper.  However,  his  approach  remains  a  binomial-based  alternative  to  Hilgers’ 

[Hilgers,  1991]  approach.  Tilbury  [Tilbury  et  al.,  2000]  claims  verification  of  results  for 
uncertainty  of  correct  detection  probability  and  false  alarm  probability,  but  these  are  (at 
best)  simply  verified  coverages  for  single  thresholds  considered  independently  (even 
here,  he  does  not  report  actual  accuracies,  but  provides  tables  of  distributed  data,  and  he 
does  not  compare  results  with  other  research).  Tilbury’s  method  in  theory  permits 
incorporation  of  prior  densities  of  false  alarm  and  correct  detection  probability,  but  not 
prior  target  and  non-target  densities. 
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In  summary,  Tilbury  provides  an  alternate  description  of  the  work  of  Hilgers 
[Hilgers,  1991]  by  leveraging  binomial  assumptions  and  forming  contour  regions  for 
particular  thresholds  rather  than  the  rectangular  regions  of  Hilgers.  His  method  does  not 
permit  the  incorporation  of  different  target  and  non-target  density  models  or  target  and 
non-target  prior  parameter  densities,  and  he  does  not  demonstrate  the  practical 
development  of  a  ROC  curve  confidence  band  jointly  across  the  entire  curve  (such  as 
those  that  are  tested  for  coverage  accuracy  in  the  research  reported  here).  His  method 
produces  confidence  bands  for  particular  thresholds  similar  to  Hilgers  but  with  different 
shape.  Tilbury  attempts  analytically  to  show  how  such  regions  could  be  combined,  but 
he  avoids  verification  (consistent  with  Macskassy’s  tractability  concerns),  except  for 
correct  detection  probability  and  false  alarm  probability  uncertainty  regions  at  individual 
threshold  points  (similar  to  Hilgers).  Further,  his  approach  is  based  on  proportions  that 
correspond  with  the  correct  detection  and  false  alarm  probability  models  but  do  not 
correspond  directly  with  "score"  and  "probability  of  target  given  score".  Thus,  Tilbury’s 
ROC  curve  confidence  interval  approach  does  not  extend  to  the  CEG  curve  and  other 
performance  metrics. 

Tosteson  [Tosteson  and  Begg,  1988]  develops  regression  parameters  to  estimate  the 
shape  of  the  ROC  curve  for  a  fixed  number  of  thresholds  (such  as  five  thresholds).  The 
regression  parameters  attempt  to  describe  the  relation  of  covariates  such  as  stage  of 
disease,  age,  and  weight  to  the  estimated  ROC  curve.  Several  related  extensions  develop 
Bayesian-based  approaches  to  more  robustly  account  for  the  regression  parameters  (see 
[Peng  and  Hall,  1996],  [Hellmich  et  al.,  1998],  and  [Zou  and  O’Malley,  2005]).  These 
approaches  assume  a  binormal  ROC  curve  form.  Smith  [Smith  et  al.,  1996]  provides  an 
alternative  to  the  binormal-based  methods,  but  Smith’s  approach  also  makes  curve  shape 
assumptions.  O’Malley  [O’Malley  et  al.,  2001]  provides  an  alternative  to  the  grouped 
data  methods  but  still  makes  binormal  assumptions.  Each  of  these  regression  based 
approaches  have  significant  limitations  compared  with  the  method  developed  here.  The 
methods  are  restricted  to  an  assumed  shape  of  the  curve;  a  shape  is  not  assumed  for  the 
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SUT-focused  ROC  curve  estimates  developed  here.  Due  to  the  focus  on  shape 
parameters,  the  methods  are  not  generally  transferable  to  other  performance  metrics  such 
as  the  CEG  curve.  Further,  the  efforts  do  not  consider  confidence  interval  coverage 
accuracy  verification.  Zou  [Zou  and  O’Malley,  2005],  O’Malley  [O’Malley  et  al.,  2001], 
and  Smith  [Smith  et  al.,  1996]  avoid  ROC  curve  confidence  intervals  altogether,  and 
Hellmich  [Hellmich  et  al.,  1998]  and  Peng  [Peng  and  Hall,  1996]  provide  confidence 
intervals  based  on  the  binormal  mean  and  slope  parameters  but  do  not  verify  coverage 
accuracy.  Note  that  the  methods  listed  above  focus  on  alternatives  to  maximum 
likelihood  estimation  for  generally  binormal  based  ROC  curves  rather  than  uncertainty  in 
such  estimates. 

A  number  of  authors  leverage  Bayesian  approaches  in  order  to  combine  ROC  curve 
results  for  meta-analysis  applications;  meta-analysis  focuses  on  pooling  the  results  of 
multiple  diagnostic  tests  (see  [Carlin,  1992],  [Smith  et  al.,  1995],  [Zhou,  1996], 
[Hellmich  et  al.,  1999],  [Rutter  and  Gatsonis,  2001],  and  [Dukic  and  Gatsonis,  2003]). 
Such  approaches  use  Bayesian-based  processes  to  combine  the  ROC  curves  and  AUC 
value  of  each  individual  test  into  a  combined  estimate  of  the  underlying  true  ROC  curve 
and  AUC  value. 

Various  approaches  focus  solely  on  AUC  value  uncertainty  (see  [DeLong  et  al.,  1988], 
[Broemeling,  2004],  [Yousef  et  al.,  2005],  [Agarwal  et  al.,  2005],  and 
[Cortes  and  Mohri,  2005]).  DeLong  [DeLong  et  al.,  1988]  leverages  U-Statistics  to 
provide  an  estimate  of  whether  two  AUC  values  are  statistically  different  from  one 
another;  DeLong  includes  an  evaluation  of  uncertainty  in  making  such  estimates.  Yousef 
focuses  on  AUC  value  standard  deviation  (which  may  exceed  one)  as  a  description  of 
uncertainty.  Yousef’s  approach  has  limitations,  as  the  AUC  values  may  be  skewed  and 
must  be  less  than  one.  Yousef  does  not  have  a  true  verification  process,  only  a 
comparison  with  results  that  are  already  available  through  traditional  bootstrapping 
processes.  Yousef  assumes  that  ROC  curves  have  convex  form.  Agarwal  and  Cortes 
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develop  approaches  that  focus  on  uncertainty  in  the  Mann- Whitney  statistic  (the 
Mann-Whitney  statistic  enables  computation  of  the  AUC  value  without  development  of 
an  entire  ROC  curve),  and  both  methods  are  limited  to  large  numbers  of  samples.  In 
comparison,  the  method  developed  here  focuses  on  ROC  curve  uncertainty,  although  the 
results  are  also  successfully  applied  to  AUC  value  uncertainty  and  then  extended  to  CEG 
curve  uncertainty.  Broemeling  proposes  a  Bayesian  based  approach  to  AUC  value 
estimation,  but  his  method  is  only  applicable  for  a  limited,  fixed  number  of  possible 
thresholds  (Broemeling  uses  five  thresholds),  rather  than  the  continuous  set  of  possible 
thresholds  that  the  research  developed  here  makes  possible.  Broemeling  computes  AUC 
value  confidence  intervals  for  two  examples  but  does  not  verify  coverage  accuracy. 

Dass  [Dass  and  Jain,  2005]  provides  an  approach  to  ROC  confidence  bands  but  with  a 
focus  on  correlated  samples.  The  Dass  approach  is  restricted  to  correlated  samples 
(rather  than  independent  samples),  is  limited  to  large  numbers  of  samples,  and  does  not 
verify  coverage  accuracy. 

Overviews  of  ROC  curve  theory  are  given  by  Centor  [Centor,  1991],  Hanley 
[Hanley,  1999],  and  Zweig  [Zweig  and  Campbell,  1993].  Hanley  and  Zweig  provide 
relevant  overviews  in  the  ROC  curve  confidence  interval  area.  More  recently,  Macskassy 
[Macskassy  et  al,  2005]  [Macskassy  and  Provost,  2004]  reviews  ROC  curve  confidence 
interval  approaches  for  the  machine  learning  community,  and  Carsten 
[Carsten  et  al.,  2003]  evaluates  ROC-curve-related  software.  Bamber  [Bamber,  1975], 
Lusted  [Lusted,  1971],  and  Swets  [Swetz  and  Pickett,  1982]  provide  historical 
background  on  ROC  curve  theory.  Bamber  identifies  the  underlying  purpose  and 
meaning  of  AUC  value.  Lusted  summarizes  the  origins  of  ROC  curve  theory  as  related 
to  signal  detectability.  Swets  and  Pickett  provide  a  widely  recognized  reference  text  on 
ROC  curve  theory.  Green  [Green  and  Swets,  1988]  provides  a  detailed  ROC  theory 
review  in  a  reprint/revision  of  a  text  originally  written  in  1966.  Lusted  discusses  the 
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relation  of  the  medical  decision  making  and  radar  progression  in  ROC  curve 
development  [Lusted,  1984]. 

As  mentioned  in  Section  2.2,  in  medical  research  sensitivity  is  typically  used  in  place  of 
correct  detection  probability,  and  one  minus  specificity  replaces  false  alarm  probability. 
Similarly,  "diseased  patients"  often  replaces  "target  data",  and  "healthy  patients"  replaces 
"non-target  data".  The  discussion  here  refers  to  target,  non-target,  probability  of  correct 
detection,  and  probability  of  false  alarm  for  consistency  even  when  the  literature  uses 
different  (but  analogous)  terms. 

Figures  2.7,  2.8,  and  2.9  provide  an  overview  of  existing  ROC  curve  confidence  interval 
approaches.  A  review  of  the  approaches  listed  in  these  figures  reveals  differences  in 
confidence  interval  definitions  and  emphasizes  that  existing  methods  lack  robustness  and 
flexibility,  the  methods  typically  identified  in  the  research  are  focused  on  a  subset  of  the 
possible  confidence  bound  definitions  and  do  not  extend  to  other  definitions.  Confidence 
bound  definitions  are  summarized  as  follows. 

Confidence  definition  1:  fixed  threshold.  This  definition  selects  a  particular  threshold, 
develops  an  estimate  for  false  alarm  probability  uncertainty,  and  similarly  develops 
correct  detection  probability  uncertainty.  Approaches  in  the  literature  often  attempt  to 
extend  this  approach.  For  example,  a  rectangular  region  is  created  based  on  the 
uncertainties  in  false  alarm  and  correct  detection  probability.  A  complete  estimate  of 
ROC  curve  uncertainty  is  then  made  by  connecting  the  corners  of  the  boxes  (see  Figure 
5.9).  A  weakness  of  this  ad  hoc  approach  is  that  typically  the  confidence  interval  band  is 
wide  compared  with  other  approaches,  particularly  at  low  sample  sizes.  Examples  are 
considered  by  Hilgers  [Hilgers,  1991]. 

Confidence  definition  2:  uncertainty  in  correct  detection  probability  at  given  false  alarm 
probability.  This  definition  regards  false  alarm  probability  as  the  independent  variable, 
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Confidence  Definition 


Distribution  Assumptions 


Confidence 
Example  or 
Verification 


Comments 


1987  Linnet 


Independ.  Var 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Normal  and  non-parametric 
(but  symmetric)  methods 


Normal  example. 
No  verification 


First  ROC 
recognition  of 
|Pfa  uncertainty 


1991  Hilgers 

2 

r- 

(1 -alpha)2  bounds 

Definition  1: 
Fixed  Threshold 
Also:  Definition  4: 
Full  Curve 


Binomial 
Order  statistics 


Example, 

No  verification  on 
full  curve, 

Individual  bounds 
Iverified  [Ross,  2003] 


Large 
confidence 
band  area 


1993  Ma  &  Hall 


W-H  Bands 


Definition  4: 

Full  Curve 
Also:  Definition  1,  2. 


Binormal, 

Working-Hotelling 
Regression  theory 


Binormal  example, 
No  verification, 


Emphasize  use 
of  multiple 
confidence 
definitions 


1994  Campbell 

Ind.  Var  Uncertainty 

2 

j- 

- 

Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Kolmogorov  distribution 
theory 


Example, 
No  verification 


Confidence 
bounds  made 
up  of  same  size 
rectangles 


Figure  2.7  ROC  literature  comparison  I.  Confidence  interval  approaches  are  listed  by 
author.  Correct  detection  probability  is  Pd  and  false  alarm  probability  is  Pfa. 
The  first  column  lists  confidence  interval  or  band  definitions.  The  second 
column  lists  distribution  assumptions.  The  third  column  indicates  whether 
confidence  interval  examples  or  verified  results  are  provided.  The  most 
promising  verified  results  are  compared  with  the  method  developed  here  in 
a  later  section.  The  fourth  column  comments  on  significant  attributes  of  the 
corresponding  research. 
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Confidence  Definition 


Distribution  Assumptions 


Confidence 
Example  or 
Verification 


Comments 


1994  Schafer 


Ind.Var  Uncertainty! 


f 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Asymptotic  theory 


Normal, 

Binormal 

examples, 

verification 


Coverages 

large. 

unreliable  for 
small  samples 


1994  Campbell 


Bootstrap 


Definition  4: 
Full  Curve 
Also:  Definition  1,  2. 


None  (bootstrap 
resampling) 


Example, 
No  verification 


Symmetric 

Bands, 

Fixed  width 
displacement 


1997  Zou 


Logit  Transform 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Kernel 

logit  transformation 


Beta  mixture 
model  example, 
No  verification 


1998  Metz 


Continuous  Scale 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Binormal 


Example, 
No  verification 


Implemented 
in  ‘Rockit’ 
software 


.Pfa.. 


2000  Platt 


Bootstrap  Test 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


None  (bootstrap 
resampling) 


Beta  and 
normal 
verification 


Verifies  Linnet 
and  other 
approaches 


Figure  2.8 


ROC  literature  comparison  II.  For  explanation,  see  Figure  2.7. 
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2003  Claeskens 
Empirical  w/CI 


2003  Claeskens 

T3 

Empirical  w/CI 


2004  Hall 
Kernel 


2005  Zhou 
Bootstrap 


Definition  1: 
Fixed  Threshold 


Definition  4: 
Full  Curve 


Definition  2: 
Uncertainty  in  Pd  at  Pfa 


Definition  2: 


Distribution  Assumptions 

Example  or 
Verification 

Kernel 

Verification  only 
specific  thresholds 

Empirical  log-likelihood 
ratio  to  estimate  curve. 

No  verification 

boostrap  method  for 

confidence  band 

Kernel 

Verification 

Adjusted  binomial 
adjustment  to  bootstrap 

Verification 

Bootstrap 
bands  based 
on  Lloyd 
(1998) 


No 

confidence 

interval 

widths 


Confidence 

interval 

widths 


Figure  2.9  ROC  literature  comparison  III.  For  explanation,  see  Figure  2.7. 
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and  the  literature  and  the  method  developed  here  tend  to  focus  on  this  choice.  Thus 
uncertainty  in  correct  detection  probability  is  calculated  at  a  given  false  alarm  probability, 
and  confidence  contours  covering  the  entire  ROC  curve  are  developed  by  repeating  for  all 
given  false  alarm  probabilities.  Linnet  [Linnet,  1987]  notes  that  assuming  false  alarm 
probability  is  known  when  it  is,  in  fact,  uncertain  introduces  error  in  correct  detection 
probability.  Examples  are  considered  by  Campbell  [Campbell,  1994],  Linnet 
[Linnet,  1987],  Schafer  [Schafer,  1994],  Zou  [Zou  et  al.,  1997],  Metz  [Metz  et  al.,  1998], 
Platt  [Platt  et  al.,  2000],  and  Zhou  [Zhou  and  Qin,  2005]. 

Confidence  definition  3:  uncertainty  in  false  alarm  probability  at  a  given  correct 
detection  probability.  This  approach  is  similar  to  confidence  Definition  2,  except  that 
correct  detection  probability  is  regarded  as  the  independent  variable.  For  beta  target  and 
non-target  densities,  the  method  developed  here  produces  confidence  bands  by  this 
definition  that  are  similar  to  the  bands  of  confidence  Definition  2.  There  are  no  known 
methods  in  the  literature  that  focus  on  this  method. 

Confidence  definition  4:  full  curve  confidence  band.  This  band  represents  the 
uncertainty  of  the  entire  ROC  curve.  The  literature  focuses  less  on  this  definition  than  on 
that  of  Definition  2.  Examples  are  considered  by  Ma  [Ma  and  Hall,  1993],  Claeskens 
[Claeskens  et  al.,  2003],  and  Campbell  [Campbell,  1994].  Bands  by  this  method 
typically  have  the  objective  of  enclosing  the  entire  true  ROC  curve  with  a  selected 
percentage  confidence.  If  even  a  small  portion  of  the  ROC  curve  is  outside  of  the  band, 
then  the  entire  band  is  regarded  as  being  in  error. 

Confidence  definition  5:  curve  location  based  on  uniform  threshold.  This  confidence 
bound  describes  ROC  curves  for  a  threshold  chosen  uniformly  at  random.  Such  bounds 
are  not  described  in  the  literature  but  are  a  natural  extension  of  the  method  developed 
here.  Figure  4.10  shows  ROC  curve  confidence  bounds  based  on  this  definition  and 
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shows  higher  densities  close  to  the  ROC  curve  extremes.  This  result  is  appropriate 
because  any  ROC  curve  has  a  correct  detection  probability  of  zero  at  a  false  alarm 
probability  of  zero  and  similarly  a  correct  detection  probability  of  one  at  false  alarm 
probability  of  one. 

Unlike  the  references  to  ROC  curves,  many  CEG  curve  and  RSD  value  approaches  differ 
from  those  developed  here.  Since  the  metrics  differ,  the  methods  of  obtaining  confidence 
intervals  or  variance  for  the  metrics  also  differ.  For  example,  Lombard  [Lombard,  2003] 
details  an  approach  for  estimating  uncertainty  in  on-line  gauges,  O’Connor 
[O’Connor  et  al.,  2001]  describes  the  asymmetry  of  confidence  intervals  related  to 
weather  forecasting,  and  Yaniv  [Yaniv  and  Foster,  1997]  analyze  the  precision  and 
accuracy  of  judgmental  estimation.  The  performance  metrics  described  in  the  latter  can 
be  transitioned  to  confidence-error-like  performance  metrics. 

The  scores  from  a  SUT  are  posterior  probability  estimates  as  detailed  by  Bishop 
[Bishop,  1995].  However,  for  the  CEG  curve  the  intent  is  not  to  estimate  posterior 
probability  but  rather  to  estimate  how  well  an  unknown  “black  box”  performs  in 
providing  estimates  of  posterior  probability.  Thus,  the  intent  is  to  provide  confidence 
intervals  for  CEG  curve  and  RSD  values,  which  characterize  score  posterior  probability. 
El-Jaroudi  [El-Jaroudi,  1990],  Lugosi  [Lugosi  and  Pawlak,  1994],  Poggio 
[Poggio  et  al.,  2004],  and  Tomasi  [Tomasi,  2004]  focus  on  estimating  error  in  posterior 
probability.  Existing  research  is  more  relevant  in  formulating  alternative  approaches  for 
determining  confidence  error  than  in  quantifying  confidence  intervals,  variance,  and/or 
the  density  of  confidence  error.  Also,  another  confidence-interval-like  method  involves 
cross-entropy  (see  [Bishop,  1995]),  which  is  a  metric  often  used  in  speech  processing. 

Research  in  the  ATR  community  for  performance  metrics  and  confidence  error  includes 
work  by  Ceritoglu  [Ceritoglu  et  al.,  2003],  DeVore  [DeVore,  2004],  Irvine 
[Irvine  et  al.,  2002],  Li  [Li  et  al.,  2001],  Mossing  [Mossing  and  Ross,  1998],  [Ross  et  al., 
1997,  1998,  1999,  2002], ,  [Ross  and  Mossing,  1999],  and  Thorsen 
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[Thorsen  and  Oxley,  2004].  These  references  help  identify  the  relevance  and  need  for 
confidence  error  as  a  performance  metric.  Ross  [Ross  and  Minardi,  2004]  develops  the 
rationale  for  a  CEG  curve -based  performance  metric  and  identifies  the  ability  of  such  a 
metric  to  provide  information  on  the  performance  of  a  target  recognition  system  that  the 
ROC  curve  is  not  able  to  provide.  Ross  points  out  that  confidence  errors  (to  include 
additional  confidence  measures  of  performance)  are  in  themselves  estimates  and 
emphasizes  that  the  ATR  community  needs  confidence  intervals  for  these  estimates. 

The  underlying  methods  and  techniques  and  probability  density  estimation  methods  that 
are  leveraged  to  form  ROC  curve  confidence  intervals  and  CEG  curve  confidence 
intervals  in  the  chapters  that  follow  must  be  considered.  The  methods  developed  here 
apply  a  Bayesian  framework  to  ROC  curve  and  CEG  curve  performance  metrics.  A 
similar  framework  was  devised  in  the  early  1990s  for  neural  networks  applications 
[Mac Kay,  1992a,  1992b];  this  framework  has  not  heretofore  been  comprehensively 
applied  to  target  detection  performance  metrics.  Bishop  [Bishop,  1995]  provides  a 
summary  of  MacKay’s  contributions.  A  critical  aspect  of  the  Bayesian  approach  is 
correct  modeling  of  the  prior  parameter  densities.  For  the  beta  density  model  considered 
here,  it  is  shown  that  sampling  uniformly  over  the  domain  of  all  means  and  standard 
deviations  yields  appropriate  results.  Chapter  3  describes  the  analytical  convergence  of 
this  procedure,  which  may  also  be  obtained  using  a  Monte  Carlo  approach.  As  model 
parameters  become  more  complex,  other  Monte  Carlo  methods  and  Bayesian  techniques 
may  be  suitable  alternatives  to  sampling  uniformly  over  parameter  domains.  Clyde 
[Clyde,  1999]  identifies  search  methods  for  posterior  densities;  and  Clyde 
[Clyde  and  George,  2004]  details  advancements  that  make  such  posterior  density 
searches  practical.  Barbieri  [Barbieri  and  Berger,  2004]  suggests  a  robust  posterior 
density  approximation  that  considers  only  parameter  values  which  have  posterior  density 
weights  that  are  50%  of  the  maximum  posterior  weight.  Jordan  [Jordan  et  al.,  1999] 
details  various  computational  methods  for  calculating  posterior  densities.  Hoeting 
[Hoeting  et  al.,  1999],  Raftery  [Raftery  et  al.,  2003],  and  Madigan 
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[Madigan  and  Raftery,  1994]  discuss  the  application  of  Occam’s  razor,  which  refers  to 
the  concept  that  whereas  more  complex  models  are  possible,  the  posterior  density 
contribution  of  a  simpler  model  should  generally  outweigh  a  more  complex  model  (other 
constraints  being  equal).  Occam’s  window  reflects  the  concept  that  all  parameter  values 
that  have  less  than  a  selected  percentage  of  the  maximum  weighting  can  be  disregarded 
without  loss  of  accuracy  [Hoeting  et  al.,  1999]. 

For  the  research  presented  here,  the  beta  density  is  appropriate  because  this  density  is 
non- zero  for  score  values  between  zero  and  one,  and  a  single  beta  density  has  a  simple 
uni  modal  form.  However,  the  use  of  the  beta  density  is  also  justified  because  it  is  the 
density  of  maximum  entropy  which  is  zero  beyond  a  limited  domain  subject  to  two 
constraints,  which  may  be  related  to  the  density  mean  and  variance.  Gokhale 
[Gokhale,  1975]  investigates  the  usefulness  of  maximum  entropy  distributions  subject  to 
various  constraints,  and  Kagan  [Kagan  et  al.,  1973]  documents  the  properties  of  the  beta 
density  relative  to  maximum  entropy.  Note  also  that  several  recent  ROC  confidence 
interval  papers  (see  [Platt  et  al.,  2000],  [Hall  et  al.,  2004],  and  [Zhou  and  Qin,  2005])  use 
beta  densities  to  generate  samples. 

2.7.3  Summary  of  existing  research.  Each  of  the  ROC  curve  uncertainty  estimation 

methods  discussed  above  have  weaknesses  that  the  method  developed  here  largely 
overcomes.  Some  methods  [Zhou  and  Qin,  2005]  only  provide  acceptable  results  as 
sample  size  becomes  large,  which  is  the  opposite  of  what  is  needed  for  the  target 
detection  applications  considered  here.  Others  methods  are  restricted  to  normal-based 
assumptions  and  can  not  be  extended  to  other  density  forms  (see  [Ma  and  Hall,  1993]); 
binormal  based  approaches  [Metz  et  al.,  1998]  make  unacceptable  restrictions  on 
functional  forms.  Still  other  methods  (e.g.  [Hilgers,  1991])  produce  confidence  regions 
that  are  too  large  and  therefore  uninformative.  Further,  most  of  the  authors  identified 
here  refrain  from  quantitative  verification  of  results;  the  few  that  do  are  examined  in 
detail  in  Chapter  5.  The  quantitative  comparison  provided  in  Chapter  5  of  the  method 
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developed  here  with  existing  methods  reinforces  the  above  discussion.  Recent  literature 
in  the  ATR  community  introduces  the  basis  for  the  CEG  curve  and  RSD  value  described 
here,  however,  methods  for  their  confidence  interval  (or  band)  uncertainty  estimation  are 
not  available,  although  the  need  for  such  methods  has  been  identified  (see 
[Ross  and  Minardi,  2004]). 

Thus,  a  review  of  the  previous  research  reveals  that  a  new  method  for  performance  metric 
uncertainty  estimation  is  needed.  The  method  developed  and  verified  in  Chapters  3  and  4 
introduces  a  flexible  new  framework  that  can  be  applied  to  ROC  curves  and  CEG  curves, 
and  it  provides  uncertainty  estimates  for  these  curves  (and  for  their  summary  metrics  of 
AUC  value  and  CEG  value). 


2-44 


3.  Probability  Density  Generation 


This  chapter  develops  methods  that  generate  probability  densities  for  target  detection 
performance  metrics,  such  as  the  ROC  curve.  The  development  process  has  the 
following  rationale.  First,  consider  that  deterministic  performance  metrics  (e.g.,  a  fully 
specified  ROC  curve  with  no  uncertainty)  assume  that  the  target  and  non-target  sample 
densities  of  score  are  known.  Such  exact  target  and  non-target  sample  densities  could  be 
determined  from  the  samples  if  it  were  possible  to  generate  an  infinite  set  of  target  and 
non-target  samples.  From  a  finite  set  of  samples,  it  is  not  possible  to  determine  exactly 
the  target  sample  density  and  the  non-target  sample  density.  Thus,  a  set  of  possible 
densities  for  a  finite  set  of  samples  is  examined,  with  each  density  defined  by  values  of 
one  or  more  parameters  (for  example,  the  parameters  for  a  univariate  Gaussian  density 
consist  of  mean  and  variance).  Next,  using  a  Bayesian  process,  parameter  values  for  the 
target  and  non-target  densities  are  found.  Finally,  the  resulting  densities  of  target  and 
non-target  samples  are  used  to  find  probability  densities  for  the  performance  metrics. 

The  procedure  for  developing  densities  is  applicable  to  any  parametric  density  model  (the 
beta  density  model  is  the  example  emphasized  here).  Once  the  performance  metric 
probability  density  is  generated,  a  variety  of  standard  descriptive  statistics  may  be 
developed,  including  mean,  median,  mode,  confidence  bounds,  etc.  Chapter  4, 
Probability  Density  Characterization  and  Verification,  considers  these  descriptive 
statistics. 

3.1  Target  and  non-target  samples,  density  models,  and  ROC  curx’e  estimates 

Section  2.2  focused  on  deterministic  ROC  curves,  where  the  underlying  target  and 
non-target  densities  are  known.  This  section  focuses  on  the  relation  of  samples  to 
assumed  underlying  target  and  non-target  score  probability  densities  and  on  ROC  curve 
estimates  obtained  from  these  densities.  Figure  3.1  shows  example  target  and  non-target 
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densities,  a  set  of  samples  generated  from  the  target  density,  and  a  second  set  of  samples 
generated  from  the  non-target  density.  Here  30  target  score  samples  (triangles)  and  30 
non-target  score  samples  (circles)  are  drawn  from  their  respective  specified  underlying 
densities.  For  an  infinite  set  of  such  samples  the  target  and  non-target  densities  are 
known.  In  this  ideal  case  the  associated  performance  metrics  of  ROC  curve,  AUC  value, 
CEG  curve,  and  RSD  value  are  deterministic  and  have  no  uncertainty.  When  only  a  finite 
number  of  samples  are  available,  the  target  and  non-target  densities  for  an  infinite 
number  of  samples  are  not  known  but  are  desired.  Any  density  that  is  non- zero  at  each 
of  the  sample  values  has  some  probability  of  being  the  density  formed  by  an  infinite  set 
of  samples.  However,  it  is  appropriate  to  consider  only  density  functional  forms  or 
models  that  incorporate  additional  available  information,  such  as  that  density  is 
continuous  and  is  non-zero  only  between  zero  and  one. 


Beta  densities  are  used  to  implement  the  performance  metric  uncertainty  estimation 
framework  developed  here.  While  this  density  model  is  reasonable,  a  major  advantage  of 
the  framework  developed  here  is  that  it  is  applicable  to  other  models.  The  beta  density  is 
of  interest  because  it  has  zero  magnitude  outside  the  interval  [0,1],  as  assumed  for  the 
target  and  non-target  score  data.  Additionally,  the  beta  density  (see  [Patel  et  al.,  1976] 
and  [Mendenhall  et  al.,  1990])  has  maximum  entropy  among  all  continuous  densities  that 
are  non-zero  only  between  zero  and  one  and  that  meet  two  additional  constraints 
[Kagan  et  al.,  1973]  which  may  be  related  to  mean  and  variance.  The  beta  density  with 
parameters  a,  b  >  0  is 


/(«) 


Ca,6s“_1(l  —  0  <  S  <  1, 

0,  elsewhere, 


(3.1) 


where  s  is  score  and  the  mean  and  variance  of  the  beta  density  are  related  to  a  and  b  by 


/i  =  a/(a  +  b)  and  a2  =  ( ab)/[(a  +  b)2{a  +  b  +  1)],  (3.2) 
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Probability  density 


Figure  3.1  Target  and  non-target  samples  and  the  densities  from  which  they  are  drawn. 

A  target  beta  density  (solid  line)  and  a  non-target  beta  density  (dashed  line) 
are  shown;  these  densities  are  typically  estimated  from  samples.  Here  30 
target  score  samples  (triangles)  and  30  non-target  score  samples  (circles)  are 
drawn  from  their  respective  densities. 
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and  the  constant  Ca,b  equals  1/  f*  sa  1(1  —  s)b  1ds  =  1/Beta(a,  b). 

A  simple  method  for  mapping  a  set  of  sample  scores  to  a  beta  density  is  to  find  the  mean 
and  variance  of  the  scores,  then  use  them  in  Equation  (3.2)  to  obtain  the  a  and  b  values. 
Once  the  scores  are  mapped  to  beta  density  form  (one  density  for  target  samples  and  the 
other  density  for  non-target  samples),  a  ROC  curve  and  corresponding  AUC  value,  as 
well  as  a  CEG  curve  and  a  corresponding  RSD  value,  are  calculated.  Note  that  using 
sample  mean  and  variance  to  estimate  a  beta  density,  where  the  sample  variance  is 
unbiased  in  that  it  is  the  sum  of  squared  deviations  from  the  mean  divided  by  the  number 
of  samples  minus  one,  is  equivalent  to  a  maximum-likelihood  approach  as  sample  size 
increases  (see  [Hahn  and  Shapiro,  1967]). 

Figure  3.2  compares  ROC  curve  estimates  for  10  sets  of  30,  300,  1000,  and  3000  target 
and  non-target  samples.  To  obtain  such  sets  for  comparison  with  the  true  ROC  curve, 
first  choose  an  underlying  target  density  and  non-target  density.  Then  find  the  ROC 
curve  that  corresponds  with  these  densities  from  Equation  (2.10).  This  ROC  curve, 
computed  numerically,  is  shown  as  the  solid  line  on  each  of  the  four  plots.  From  the 
densities,  randomly  and  independently  draw  30  target  samples  and  30  non-target  samples 
to  obtain  one  set  of  data.  Estimate  the  target  and  non-target  beta  densities  as  the  densities 
with  the  mean  and  unbiased  variance  of  the  target  and  non-target  samples  (mean  and 
variance  determine  the  density  parameter  vectors  u  and  v  of  Equation  (2. 10)),  and  form  a 
ROC  curve  from  these  estimates.  Find  the  10  sets  of  ROC  curves  for  the  30  target  and  30 
non-target  samples,  then  repeat  for  10  sets  of  300,  1000,  and  3000  pairs  of  target  and 
non-target  samples.  Note  that  even  for  the  3000  sample  example,  differences  in  the  ROC 
curve  estimates  are  apparent.  Figure  3.3  shows  a  similar  progression,  except  that  here 
the  ROC  curves  are  formed  by  evaluating  the  correct  detection  probability  and  false 
alarm  probability  at  every  score  value  using  only  the  sample  values  and  not  an  assumed 
model.  Figures  3.2  and  3.3  indicate  that  ROC  curve  estimates  for  low  numbers  of 
samples  may  not  be  close  to  the  true  ROC  curve.  The  variance  shown  in  the  plots  in 
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Figure  3.2  The  ROC  curve  estimates  for  various  sample  sizes,  where  beta  density  esti¬ 
mates  generate  the  ROC  curves.  Target  and  non-target  beta  densities  gen¬ 
erate  target  and  non-target  samples,  and  ROC  curve  estimates  are  formed 
from  beta  densities  that  have  the  mean  and  variance  of  the  samples.  For  the 
top  left  plot,  10  ROC  curves  (dashed  lines)  for  10  sets  of  30  target  and  30 
non-target  samples  are  generated  by  fitting  beta  densities  to  the  samples.  In 
the  other  plots,  similar  sets  of  ROC  curves  for  300,  1000,  and  3000  pairs  of 
target  and  non-target  samples  are  generated.  The  actual  ROC  curve  that  the 
densities  form  for  an  infinite  number  of  samples  is  shown  as  the  solid  line 
on  each  plot.  Variance  is  apparent  in  the  plots,  even  for  3000  target  samples 
and  3000  non-target  samples. 
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Figure  3.3  The  ROC  curve  estimates  for  various  sample  sizes,  where  the  empirical  sam¬ 
ples  generate  the  ROC  curves.  The  four  plots  are  formed  using  the  process 
of  Figure  3.2,  except  the  ROC  curves  are  formed  directly  using  the  sample 
values;  a  beta  density  form  is  not  assumed.  The  variance  in  each  of  the 
plots  emphasizes  the  importance  of  ROC  curve  uncertainty  estimation  and 
the  inadvisability  of  focusing  on  one  ROC  curve  estimate. 
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these  two  figures  emphasize  the  importance  of  ROC  curve  uncertainty  rather  than  the 
estimated  ROC  curve.  Section  3.2  details  a  fully  Bayesian  process  for  estimating  ROC 
curve  uncertainty. 

Unimodal  beta  densities  and  score-threshold  ROC  curves  are  the  assumed  model  and 
performance  metric  for  much  of  the  research  discussed  here;  however,  the  beta  density  is 
used  for  illustration.  The  framework  developed  in  the  next  section  (with  beta  densities) 
may  be  applied  to  other  density  models  and  to  likelihood  threshold  ROC  curves.  For 
example,  multi-modal  beta  mixture  models  and  related  empirical-threshold  and 
likelihood-threshold  ROC  curves  are  shown  in  Figures  3.4  and  3.5. 

3.2  Bayesian  posterior  densities  of  parameters  and  weighted  ROC  curx’es 

The  two  left  plots  of  Figure  3.6  show  the  collection  of  pairs  of  means  and  standard 
deviations  for  beta  densities  that  are  zero  at  scores  of  0  and  1.  Values  of  standard 
deviation  outside  each  “rounded  triangle”  do  not  exist  for  these  densities.  Values  of 
standard  deviation  inside  each  "rounded  triangle"  are  the  admissible  set,  where  the 
admissible  set  is  described  as  follows.  For  the  case  of  this  beta  density  model,  the 
admissible  set  A  consists  of  (/r,  a)  pairs  such  that 

(  if0<„<0.5 .  a  <  „(>,+^+1),  1 
(  if  0.5  <  n  <  1,  a  <  J 

Admissible  sets  may  also  be  defined  for  other  density  models,  including  density  models 
that  are  not  restricted  to  two  parameters.  The  target  and  non-target  densities  shown  in  the 
right  plot  of  this  figure  map  to  unique  locations  on  the  standard  deviation  versus  mean 
graphs  shown  at  the  left.  Applying  Bayes’  rule  in  a  process  consistent  with  that 
developed  by  [Mac Kay  1992a,  1992b]  for  the  neural  network  community,  but  not 
heretofore  applied  to  target  detection  performance  metrics,  the  densities  of  model 
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Figure  3.4  Target  (solid)  and  non-target  density  (dashed)  examples  with  a  beta  mixture 
model.  In  the  upper  graph  two  separate  sums  of  30  beta  densities  form  the 
target  and  non-target  densities.  Similarly,  a  sum  of  two  beta  densities  form 
each  density  in  the  lower  graph.  (The  target  density  has  0.82,  0.055  and 
0.7,  0.045  for  the  mean  and  standard  deviation  of  the  two  beta  densities,  and 
the  ratio  of  their  amplitudes  is  0.45.  The  corresponding  five  values  for  the 
non-target  density  are  0.6,  0.084,  0.45,  0.071,  and  0.45). 


3-8 


o 

tH 

Dh 

c 

o 

4-i 

o 

<D 

-t— » 

<U 

T3 


o 

cu 

o 

U 


Figure  3.5  Relation  of  the  true  likelihood-threshold  ROC  curve  (dot-dash  line),  the  true 
score-threshold  ROC  curve  (solid  line),  and  the  empirical-threshold  ROC 
curve  (dashed  line).  The  true  ROC  curves  assume  knowledge  of  the  under¬ 
lying  densities  (shown  in  the  the  bottom  plot  of  Figure  3.4).  For  the  true 
likelihood-threshold  ROC  curve,  probability  of  detection  is  the  integral  of 
the  target  density  over  the  region  to  the  left  of  the  first  vertical  line  and  the 
region  to  the  right  of  the  the  second  vertical  line  in  Figure  2.1.  Similarly, 
the  probability  of  false  alarm  is  the  integral  of  the  non-target  density  over  the 
same  regions. 
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Figure  3.6  Bayesian  posterior  densities  of  parameters.  The  two  plots  at  the  left  show 
the  admissible  domains  of  means  and  standard  deviations  for  beta  densities. 
Values  of  standard  deviation  outside  each  “rounded  triangle”  do  not  exist  for 
these  densities.  The  target  and  non-target  densities  shown  at  the  right  map 
to  specific  locations  on  the  standard  deviation  versus  mean  graphs  shown  at 
the  left. 
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parameters  given  a  set  of  samples  are  obtained.  Further,  if  the  target  and  non-target 
samples  are  independent,  a  joint  posterior  weight  is  obtained  for  any  combination  of 
target  and  non-target  densities.  The  application  of  Bayes’  rule  requires  the  specification 
of  prior  parameter  densities.  The  typical  prior  density  has  uniform  distributions  of  mean 
and  standard  deviation  over  their  admissible  domains. 

The  following  discussion  outlines  an  analytical  determination  of  a  ROC  curve  density. 

As  is  typical  for  Bayesian  evaluations,  the  analytical  results  produce  integrals  that  are  not 
tractable  to  further  evaluate  analytically  (see  Mac  Kay  [Mac  Kay,  1992a],  Bishop 
[Bishop,  1995],  Clyde  [Clyde,  1999][Clyde  and  George,  2004],  Hoeting 
[Hoeting  et  al. ,  1999],  and  Jordan  [Jordan  et  al.,  1999]).  However,  numerical  evaluation 
is  possible  for  the  beta  density  model  and  for  more  complex  density  models  (such  as  beta 
mixture  models). 

Throughout  the  analytical  progression  that  follows,  the  subscripts  on  density  (for 
example  the  subscript  u\d  on  pu\d)  are  used  indicate  the  quantities  being  evaluated  as 
random  variables  (see  discussion  in  Section  2.2  regarding  relation  of  random  variables 
and  parameters). 

Let  d  =  {.s\  i  =  1, ...,  1}  be  a  set  of  known  independent  non-target  score  samples,  where 
si  is  the  ith  non-target  score  sample,  and  let  u  be  the  non-target  density  parameters.  For 
example,  for  a  beta  density  model,  u  may  be  the  (pn,  an)  parameters  that  are  the 
allowable  means  (/Jn)  and  standard  deviations  (ern)  from  the  admissible  set.  Let 
Pu\d{u\d)  be  the  conditional  probability  density  of  the  non-target  score  parameters  u 
given  d.  Then  by  Bayes’  rule,  pu\d(u\d)  is 

Pu\d(u\d)  =  C0pd\u(d\u)pu(u),  (3.4) 
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where  the  constant  CQ  depends  on  d,  where  Pd\u{d\u)  >s  the  conditional  probability 
density  of  the  samples  given  the  parameters  and  pu(u)  is  the  prior  probability  density  of 
the  parameters. 

For  a  beta  probability  density,  Equation  (3.4)  is 


P(fj,n,<Tn)\d(Pni  \d)  ClPd\{fin,a„)  {d\fl>n,  &n)Ptin,an  (P'm  (3.5) 

where  the  constant  C\  depends  on  d. 

By  sample  independence,  the  probability  density  of  the  samples  given  the  non-target 
score  parameters,  Pd\(nn,an)(d\fin,  an)  is 


i 

Pd|(//n,crn)  {d\pni  °ra)  =  ^2 

i=l 


f  PnO- _ Mw)  _  1  ] _  1 

L  (7  n.  J 


(1 


r  Mn(l  Mn) 


-i][^ — ll—i 

JL/in  J 


n^n(l7n) -dt^-1]) 


(3.6) 


where  the  constant  C2  depends  on  d. 
Thus, 


P(jin,<rn)\d{,P1ni  ^nM) 


=  c3{n 


i[El lQ _ Pn) 


(!  -  $iY 


r(^rn  Vn) 

r(M  -1]+M  -!]  [_L  _!]) 


'}P|UnAn  (Pni  an)i 


(3.7) 


where  the  constant  C3  depends  on  d. 
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If  the  assumption  is  made  that  p^n)(Tn  (/Jn,  crn)  is  uniform  over  all  allowable  values  of 
Mn,  on,  then 


i 

P(Hn,an)\d{^ni  On\d)  —  C*4  j 

i=l 


SA 


(1 


[  Mn(i  Mn  ) 


-il[^ — ll—i 

JLMn  J 


(3.8) 


where  the  constant  C4  depends  on  d. 

The  points  (jiro  an )  chosen  within  the  admissible  set  are  used  to  estimate  Bayesian 
posterior  densities.  Each  Bayesian  posterior  density  may  be  visualized  as  the 
three-dimensional  function  described  by  Equation  (3.8)  that  is  non-zero  for  any  value 
within  the  admissible  set.  The  uniformly  spaced  points  shown  in  the  plots  on  the  left  in 
Figure  3.6  select  the  elements  of  u  and  v  that  are  evaluated  numerically. 

Let  h  =  { qj  j  =  1,  ...,  J}  be  a  set  of  known  independent  target  score  samples,  where  q:j 
is  the  jth  target  score  sample,  and  let  v  be  the  target  density  parameters.  For  example, 
for  a  beta  density  model,  v  may  be  the  (jjt,  at)  parameters  that  are  the  allowable  means 
(jjt)  and  standard  deviations  (fit)  from  the  admissible  set.  Then  applying  the  analysis 
above  yields  expressions  similar  to  Equations  (3.5)  to  (3.8),  where  the  expression  for 
Pnt,°-t\h(dt,  at\ h)  is  obtained  by  replacing  i  with  j,  I  with  J,  u  with  v,  and  (nn,  an)  with 
(tit,  at )  in  Equation  (3.8). 
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Theorem  3.1  Posterior  density  evaluation  for  the  parameters  given  the  non-target 
samples 

Let  ps\u{si\uk)  be  the  non-target  score  probability  density  evaluated  at  the  ith  non-target 
score  sample  given  the  kth  non-target  sample  parameter  where  ///,.  specifies  a  vector. 
Let  pu\d{uk\d)  be  the  probability  density  of  the  non-target  sample  parameters  evaluated  at 
uk  given  the  non-target  samples  d,  where  d  =  {s.  |  i  =  1 Let  pu(uk )  be  the  prior 
probability  density  of  the  non-target  sample  parameter  vector  evaluated  at  uk.  Assume 
that  the  non-target  samples  are  independent  and  identically  distributed.  Then 

i 

Pu\d(ttk  | rf)  C5  Ps\u^i  \ttk)Pu{ttk)  j  (3.9) 

i= 1 


where  the  constant  C5  depends  on  d. 

Proof 

By  non-target  sample  independence  and  identical  distribution 

1 

Pd\u(d\ttk)  C*6  Ps|m('5,1 .  ;  (3.10) 

i= 1 


where  the  constant  Ce  depends  on  d. 
From  Bayes’  rule 


Pu\d(uk\d)  =  C7pd\u(d\uk)pu(uk),  (3.11) 

where  the  constant  C7  depends  on  d. 

Therefore,  combining  Equations  (3.10)  and  (3.11)  yields  (3.9). 
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As  an  example,  for  a  beta  density  model,  Uk  specifies  a  mean  and  standard  deviation.  An 
expression  for  pv\h(vm\h)  is  developed  similarly. 

In  Figure  3.6,  the  oval  regions  shown  in  the  vicinity  of  the  target  and  non-target  mean  and 
standard  deviation  values  provide  a  confidence  contour  for  the  posterior  probability  that 
the  given  set  of  samples  is  obtained  from  densities  parameterized  by  the  indicated 
regions.  An  example  of  a  graph  of  Bayesian  posterior  density  is  shown  in  Figure  3.7.  A 
plane  that  intersects  the  graph  of  the  density  such  that  a  selected  percentage  (e.g.,  90%) 
of  the  volume  of  the  density  is  enclosed  defines  a  confidence  contour. 

Definition  -  Confidence  contour  for  the  non-target  parameter  density 

Let  pu\d{u\d)  be  the  probability  density  of  the  non-target  sample  parameters  given  the 
non-target  samples  d.  Let  c.c.  be  the  desired  confidence  coverage  (e.g.,  if  the  desired 
coverage  is  90%,  then  the  confidence  contour  fraction  is  0.90).  Let  u  have  elements 
(pn,  an)  in  the  domain  of  the  admissible  set.  For  any  z  >  0,  let  Nz  consist  of  the  set  of 

all  (pn,an)  where  p{flnj(7n)ld(pn,an\d)  >  z. 


Nz  is  the  the  set  of  u  (within  the  admissible  set)  that  provides  the  desired  confidence 
coverage  (c.c). 

To  evaluate  numerically,  let 


ztest  =  mzx(p{^an)ld(pn,  crn\d)).  (3.13) 

Find  NZtest  for  %est-  Then  find  c.c.test  for N^test.  If  c.c.test  <  c.c.,  then  let 

Ztest  =  Ztest  —  £■  The  value  £  is  a  selected  step  size  by  which  the  change  in  the  value  of 
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Figure  3.7  Bayesian  posterior  density  of  beta  density  parameters.  The  posterior  density 
formed  from  300  mean  //  and  standard  deviation  a  pairs  with  respect  to  a 
set  of  30  target  samples  from  a  beta  density  of  score  is  shown  (a  similar  plot 
applies  for  30  non-target  samples).  The  maximum  likelihood  estimate  for 
the  mean  and  standard  deviation  is  at  the  peak  of  the  displayed  density. 
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%est  is  specified.  Repeat  the  process,  continuing  to  reduce  %est  until  c.c.test  =  c.c..  The 
confidence  contour  for  the  target  parameter  density  is  developed  similarly. 

By  Bayes’  rule  and  assumed  sample  independence,  the  posterior  probability  that  a 
selected  target  mean  and  standard  deviation  are  the  parameters  that  specify  the  true  (or 
underlying)  beta  target  density  given  a  set  of  samples  is  proportional  to  the  product  of  all 
density  values  for  the  samples  multiplied  by  the  prior  probability  density  of  the 
parameters.  This  process  of  evaluating  the  posterior  density  is  repeated  for  a  set  of 
non-target  samples.  Then  the  results  are  multiplied  to  obtain  a  value  proportional  to  the 
probability  that  a  pair  of  target  parameters  and  a  pair  of  non-target  parameters  are  the 
parameters  of  the  underlying  target  and  non-target  densities  of  scores.  The  posterior 
density  in  Figure  3.7  illustrates  Equation  (3.9).  Any  point  within  the  admissible  set  is 
weighted  by 

i 

Wk  =  Y\_Ps\u(Si\Uk)pU(uk),  (3.14) 

i= 1 

where  wi c  is  the  weight  for  the  non-target  parameters  uk.  A  similar  expression  applies  for 
wm,  where  wm  is  the  weight  for  point  vm  and  the  replacement  of  k  by  m  indicates  target 
point  m. 

Let  the  product  WkWm  be  the  combined  posterior  weighting  of  a  target  and  non-target 
density  pair  (evaluated  at  Uk ,  vm).  From  Equation  (3.14)  for  Wk  and  the  similar 
expression  for  wm, 

K  M 

wkwm  =  W_pd\u{d\uk)Pu{uk)  n  Ph\v(h\vm)pv(vm).  (3.15) 

k= 1  m= 1 

From  Equation  (2.5),  Fk(t\  uk)  =  f°°  f(s ;  uk)ds,  and  from  Equation  (2.6),  vm  is 
G(t;  Vm)  =  ft°°  g(s ;  vm)ds.  Thus,  from  Equation  (2.10)  the  ROC  curve  is 


)  G(F  i^X :  Uk)  i  V m) . 


(3.16) 
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Theorem  3.2  ROC  curve  density 


Let  d  ={si-..s/}  be  a  set  of  independent  and  identically  distributed  samples  s,  from 
distribution  /  and  let  h  ={qi...qj}be  a  set  of  independent  and  identically  distributed 
samples  q3  from  distribution  g,  where  i  =  1,2,...,/  and  j  =  1,2,...,  J.  Let  pu(u)  and 
pviy)  be  prior  densities  of  the  random  parameter  vectors  u  and  v.  Let  py\x(y \x,  d,  h )  be 
the  probability  density  of  correct  detection  probability  y  given  false  alarm  probability  x 
and  d  and  h.  Then 


py\x(y\x,d,h)  =C8  //  Py\x(y\x,u,v)Y[f(si\u)pu(u)  n  g(qj\v)pv(v)dudv,  (3.17) 


A 


i=i 


3= 1 


where  the  contant  C$  depends  on  d  and  h  and  the  limits  of  integration  are  over  the 
admissible  set  A. 

Proof.  See  Appendix  A-2. 

Substituting  the  beta  density  parameters  and  admissible  set  into  Equation  (3.17): 


-.5 


py\x(y\x,d,h)  =  C9 


, _ 

/717h-2)(77+iA 


-.5 


, _ ^  Mr? _ 

Mn(/in  +  2)(Mn+1)2 


Pv\x(y\x,U,v) 


I  J 

JJ  f(si\u)  JJ  g(qf,  v)pu(u)pv(v)dandnndatdnt 
i=  1  j=l 


+ 


M/,(  l-Pt) 
2  —  Mt 


Mn) 

2~tJ-n 


Py\x(y\x,U,v) 
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/  J 

■  JI  /(«*;  u)  JJ  g(qf  v)pu(u)pv(v)dandpndatdpt 

i=  1  j=l 


_ 1  Mn _ 

Mn(Mn+2)(/irt  +  1)2 


Mn(^  Mn) 
2 -(*„ 


/  J 

'  II  /(s*|«)  ]^^(gj|v)pu('u)p,,('t;)(i(Tn(i/xn(icrt(i^ 
»=i  i=i 


_ 1  _ 

Mn(Mn+2)(/in  +  1)2 


/  j 

■\\f{si\u)\\_g{qj\v)pu(u)pv(v)dandpind(Jtdpu  (3.18) 

*= i  i=i 

where 


u,  V )  =  py|x(y|x,  nn,  crn,  pt  at)  (3.19) 

7  7 

n  /(Si|w)  =  f{Si\nn,  (Tn )  (3.20) 

i=l  i=l 

J  J 

=  TldiqMiVt)  (3.2i) 

3=1  3=1 

Puiu)  =  P^an{Pm  °n)  (3.22) 
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Pv(v)  =PlHat{llt,l Jt), 


(3.23) 


where  the  contant  C9  depends  on  d  and  h. 

Lemma  3.1  Discretization  of  posterior  densities 

Let  d  be  a  set  of  independent  and  identically  distributed  samples  sr  of  /  and  let  h  be  a  set 
of  independent  and  identically  distributed  samples  q3  of  g,  where  i  =  1,  2 and 
j  =  1,  2, J.  Let  pu(u)  and  pv(v)  be  prior  densities  of  the  parameter  vectors  u  and  v 
with  elements  (pn,  <jn)  and  (gt,  at),  respectively.  Let  «/,  and  vm  be  u  and  v  selected 
uniformly  over  the  parameter  domains  within  the  admissible  set.  Finally,  let 

■A-k  (/^re,(fc+ 1)  Pn,k)(® n,{k+ 1)  &n,k)  An  (pn,(k+ 1)  P,n,k)iCrri,(k+ 1)  ®n,k)  and 

let  Am  —  ~  P't,m)(.<Jt,(m+ 1)  ~~  &t,m)  and  =  (Pt,(m+ 1)  —  P't,m)(<Tt,(m+ 1) 

where  the  second  subscript  designates  position  in  the  admissible  set  domain. 

Then 


C 


10 


fJAm 

J  A  J  A  j — 2. 


si\dm  crn  (Pni  ^n)] d(Tndp,r 


K  I 

~  (-'ll  y  ^  1  J_  Ps\/j,n,o-n  ( Si  | PJn,ki  & n,k)P \xn  crn{pn,ki  ® n ,k) 
k=l  i= 1 


(3.24) 


and 


C 


12 


/  /  U^MPm, 

A  ^  A  j=i 


fpt,  at)]datdpt 
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(3.25) 


M  J 

—  Cl  3  y  ''y  I  I  Ps\fj,t,at  {Qj\^t,nv  &t,m)Pnt  at{dt,mi  ®t,m)i 

M — >oo  z 

m= 1  j= 1 

where  the  constant  CU)  depends  on  d,  the  constant  Cn  depends  on  Cw  and  Ak.  the 
constant  C\2  depends  on  h,  the  constant  C'i3  depends  on  C\2  and  Am,  and  the  the  limits 
of  integration  are  over  the  admissible  set  A. 

Proof 

Since  each  evaluated  (/rn  rrn.k)  is  uniformly  spaced  on  the  admissible  set,  K  oc  1/An 
and  M  oc  1/A/,  then  the  lemma  follows  by  definition  of  a  double  integral  and  by  limit  of 
a  Riemann  sum  (see  [Larson  et  al.,  2002]). 
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Theorem  3.3  Numerical  approximation  of  ROC  curve  density 


Let  py\x(y\x,  d,  h )  be  the  density  of  correct  detection  probability  y  given  false  alarm 
probability  x  and  d  and  h.  Let  py\x(y\x)  =  5(y  —  r(x;  w)),  where  6  is  the  dirac  density  or 
distribution  function,  let  pu(u )  be  the  prior  density  of  the  non-target  parameters, and  pv{v) 
be  the  prior  density  of  the  target  parameters.  Let  {( Uk ,  vm)  :  k  =  1, K,  m  =  1, 

M}  be  uniformly  selected  over  the  admissible  set  of  u  and  v  for  the  target  and  non-target 
parameter  densities.  Let  ps\u(si\uk)  be  the  density  of  the  independent  and  identically 
distributed  non-target  samples  evaluated  at  the  ith  non-target  sample  st  given  the  kth 
non-target  sample  parameters  uk,  where  uk  has  elements  (pk,  ak)  over  the  admissible  set. 
Let  pu\d{u\d)  be  the  density  of  the  non-target  sample  parameters  given  the  non-target 
samples  d.  Let  pu(uk)  be  the  prior  density  of  the  non-target  sample  parameter  vector 
evaluated  at  uk,  and  let  f  @n,k)  Ps\(rn  crn)(^ i\dn,ki  ® n,k )•  Letps|„(Q,j|,um)  be  the 

density  of  the  independent  and  identically  distributed  target  samples  evaluated  at  the  jth, 
target  sample  (p  given  the  mth  target  sample  parameters  vrn,  where  vrn  has  elements 
(j-it  k.  &t,k)  over  the  admissible  set.  Let  pv\h(v\h)  be  the  density  of  the  target  sample 
parameters  given  the  target  samples  h.  Let  pv(vrn)  be  the  prior  density  of  the  target 
sample  parameter  vector  evaluated  at  vm ,  and  let 

=Ps\nt,aMj\lh,m^t,m)-  Finally,  let 


K  I 


7 (d)  =  lim  k,anp)p^  k,anp)}  (3.26) 

K — >oc  z 


k= 1  i= 1 


and 


i{d)  =  Y[[f(Si\dn,Vn)Pvnpn(dn,Vn)\du, 


(3.27) 


A 


i= 1 


where  the  limits  of  integration  are  over  the  admissible  set  A. 
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Then 


Pv\x(y\x,d,h) 

KM  I  J 

=  C\4  lim  ,EE  6(y-r(x-,uk,vm))  ni/<  Si\uk)pu(Uk )]  n^C^'l vm)Pv(vk)\,  (3.28) 

M— »oo  k=l  m=  1  i= 1  j= 1 

where  the  constant  C'14  depends  on  K,  M,  d  and  h. 

Proof 

Since  from  Lemma  3.1,  7 '(d)  oc  7 (d), 


l\d)  JJ 

A  j=1 


M  J 


^157 (d)  ^  ^  J_  J_[Ps|(/it,crt)  (?,  I Pt,m,(7t,im)P^-t,at  (P‘t,mf* t,rri)\  i 

m=  1  j= 1 


(3.29) 


where  the  constant  6)5  depends  on  A',  M,  d  and  h. 


IJIlfir ■<- 

A  A  1 — 1 


A  A 


J 

Vn)pMn*n(pn,  Vn)]  ]J[g(qj\Pu  °t)p^t(Pt,  at)\dandpndatdpt 

3= 1 


K  M  I 

=  C'16  eeiiip  l/X./c)  &  n,k)P  nn,(j„  {dn,ki  an,t c)] 

fc=l  m=l  i=l 


J 

‘  J_  J_[Ps|(A‘t,'Tt)((?i  l/'h.rrn  ®t,rn)Pnt <Jt(Pt,rm  °'t,m)]r  (3.30) 

j= 1 

where  the  constant  G'lf,  depends  on  on  K,  M,  d  and  h. 
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Thus, 


A 


Py\x(y\x,U,V ) 


A 


I 

ni/<  si\Pni  ® n)P nn,an  (prv  ^n)] 

i= 1 


J 

■  Ilbtek,  at)Pnt,at{Pv  crt)\d<Jndnndatdnt  (3.31) 

3= 1 


K  M  I 

=  c17  EE  py\x(y\x,u,v)  U\p  s\ (VnAn)(Si  I Pn,ki  cfn,k)Ptin,cn{.y,n,ki  ^n,k)\ 

k= 1  m= 1  i= 1 


J 

'  J_  J_[Ps|/it,crt  (?i  <^t,m)Pfj,t  (7t  {Pt,rrv  Gt,m)\i  (3.32) 

i= 1 

where  the  constant  C'17  depends  on  K,  M.  d  and  h. 

The  theorem  follows  upon  substituting  Equation  (3.32)  into  (3.31)  and  using  Equation 
(3.17). 


To  extend  the  above  theorem  to  the  CEG  curve,  let  the  CEG  curve  be  defined  as  (see 
Section  2.3) 


P(T\s,ukvm) 


g(s\T,vm)P(T) 

g(s\T,vm)P(T)  +  f(s\N ,  Uk)P(N)  ’ 


(3.33) 


where  s  e  [0,1].  Let  {(uk,  vm)  :  k  =  1, ...,  K,  m  —  1, ...,  M}  be  uniformly  selected  over 
the  admissible  set  of  u  and  v  for  the  target  and  non-target  parameter  densities.  Let  y 
denote  a  selected  location  on  the  vertical  axis  of  the  CEG  curve  (see  Figure  2.2  for  a 
CEG  curve  plot).  Let  P{T\s,  ukvm)  be  the  probability  of  target  event  given  score,  uk  and 
vm,  let  g(s\T,  vrn)  be  the  density  of  score  given  target  event  and  vrn,  let  f(s\N,  uk)  be  the 
probability  density  of  score  given  non-target  event  and  uk,  let  P(T )  be  the  prior 
probability  of  target  event,  and  let  P(N)  be  the  prior  probability  of  non-target  event. 
Replace  r(x;  u/,,  vrn)  by  P{T\s,  UkVm ).  Then  the  probability  density  of  the  probability  of 
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target  given  score  for  any  evaluated  score  value  is 


Pp(T\s)(P(T\s),d,h) 

KM  I  J 

(']  5  K^x  ZE  8(y  -  P(T\s]uk,  Vm))  nw  Si\uk)pu(uk)\  \vm)Pv(vk)}, 

M— »oo  k=  1  m=  1  i= 1  j= 1 

(3.34) 

where  the  constant  C'15  depends  on  A',  M,  d  and  h. 

Note  that  covering  the  entire  admissible  parameter  space  volume  with  a  practical  number 
of  grid  points  becomes  computationally  more  difficult  as  the  number  of  dimensions 
increases  (see  [Gelman  et  al.,  2004]).  For  higher  dimensions,  Monte  Carlo  methods  (see 
[Hammersley  and  Handscomb,  1964],  [Kass  and  Raftery,  1995])  or  related 
approximation  methods  may  be  used  (such  as  Gibbs  sampling  or  the  Metropolis 
Algorithm;  see  [Casella  and  Berger,  2002],  [MacKay,  2003]);  where  i.i.d.  sampling 
assumptions  are  necessary. 

Note  that  a  fundamental  assumption  for  a  simple  Monte  Carlo  approach  (see 
[Hammersley  and  Handscomb,  1964]  and  [Kass  and  Raftery,  1995])  is 

K 

/  ps(s\u)pu(u)du  =  Ci6  lim  y^ps(s\uk)pu(uk),  (3.35) 

/  k— >00  z — ' 

J  k= 1 

where  the  constant  C'i6  depends  on  A", and  the  K  grid  points  are  independently  and 
identically  selected  from  the  admissible  set.  Equation  (3.35)  may  replace  Equation 
(3.24)  for  i.i.d.  sampling  rather  than  uniform  grid  selection;  thus  the  framework 
described  here  is  appropriate  for  Monte  Carlo  methods. 

Calculating  wk/m  values  and  the  rk/m  (x:  ukvm )  function  is  straightforward  and 
numerically  tractable.  However,  it  is  desirable  to  limit  the  size  of  K  and  M  by  removing 
the  regions  where  wkm  approach  zero  (i.e.,  select  only  uk  and  vm  values  such  that  wkm  is 
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greater  than  a  given  small  value).  For  computational  efficiency,  an  iterative  process  is 
used.  The  iterative  process  is  described  below;  Section  5.5  gives  a  full  description  of  the 
numerical  evaluation  process  used  for  the  results  shown  in  Chapters  4  and  5. 

Procedure  3. 1  Iterative  Process  for  calculating  weight  values 

1.  Select  K  evaluation  points  k  =  1,2,...,  K  (e.g.,  K  =  300)  that  are  uniform  over  the 
admissible  set  of  non-target  score  parameters  uk:  where  each  uk  consists  of  mean  fink 
and  standard  deviation  an  k. 

2.  Select  M  evaluation  points  m  =  1,  2,  ...,  M  (e.g.,  M  =  300)  that  are  uniform  over  the 
admissible  set  of  target  score  parameters  um>  where  each  vm  consists  of  mean  nt  rn  and 
standard  deviation  at,m. 

i 

3.  Find  wk  =  f["  S\u(si  | Uk)pu{uk)}  f°r  each  evaluation  point  selected  in  step  1  and  for  a 

i=  1 

given  set  of  I  target  samples  s*,  i  =  1,  2 

j 

4.  Find  wm  =  \jJs\visj \vm)pv ivm)}  for  each  evaluation  point  selected  in  step  2  and  for 

3= i 

a  given  set  of  J  target  samples  Sj,j  =  1,  2, ...,  J. 

5.  Combine  all  wk  and  wm  pairs  from  steps  3  and  4  to  find  the  initial  values  (e.g.,  90,000) 

of  wkwm. 

6.  Find  the  root  mean  squared  distance  to  the  mean  of  the  parameter  values  for  each 

(/Vk.O’n.k)  Pak’  Le”  l (dn,k  -  i  EfcLl  Pn,k)2+(?n,k  ~  £  EfcLl  CTn,fc)2]1/2- 

7.  Repeat  step  6  for  each  (Pt,m,at,m)  pair. 

8.  Retain  a  subset  of  the  combinations  of  the  wkwm  pairs  that  are  closest  in  distance  as 
defined  by  steps  6  and  7  to  the  mean  of  non-target  and  target  parameter  values, 
respectively.  Also,  retain  any  additional  wk  and  wrn  pairs  without  regard  to  distance 
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whose  Wk wm  value  is  greater  than  the  lowest  vjkwrn  value  of  the  subset  of  pairs  that  are 
closest  in  distance. 

9.  Create  a  new  uniform  grid  of  target  mean  and  standard  deviation  values  (for  example, 
10  x  10)  that  bound  the  region  formed  by  the  pairs  retained  in  steps  6  and  8;  the  new  grid 
forms  new  (nnkon,k)  pairs  (for  example  100). 

10.  Create  as  in  step  9  a  new  uniform  grid  of  non-target  means  and  standard  deviation 
values  (for  example,  10  x  10)  that  bound  the  region  formed  by  the  pairs  identified  in  steps 
7  and  8;  the  new  grid  forms  new  (Ht,m,at,m)  pairs  (for  example,  100). 

11.  Find  the  posterior  weightings  WkWm  of  the  new  pairs  (e.g.,  10,000  posterior 
weightings). 

12.  Retain  all  (iuk,  wrn)  pairs  such  that  99.9%  of  the  total  posterior  parameter  weightings 
are  maintained. 

13.  Repeat  steps  9  through  12,  except  use  the  region  formed  by  the  pairs  identified  in  step 
12  rather  than  step  9. 

As  the  number  of  non-target  samples  and  target  samples  increases,  the  probability 
density  shown  in  Figure  3.7  is  more  highly  peaked,  and  the  region  where  the  weights  wk 
and  wm  have  significant  magnitudes  is  smaller. 
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Theorem  3.4 


True  versus  possible  parameter  sets 


Let  d  —  {s*  :  i  —  1, C  [0, 1]  be  a  set  of  independent  and  identically  distributed 
samples  of  the  density  of  non-target  samples  f(s:  u ).  Let  Fr(s)  be  the  distribution  of 
these  samples.  Let  v~h  be  the  true  (underlying)  parameter  values  of  the  non-target  density 
/.  Let  uz  be  a  possible  parameter  from  the  admissible  set  A,  and  let 

i  i 

c^n/(^)-n%4  (3-36) 

i= 1  i= 1 

Then,  as  /  — »  oo,  cz  increases  for  all  z  f-b. 

Proof 

By  definition  of  independent  and  identically  distributed  samples,  the  distribution  of  the 
samples  FI(s)  equals  the  distribution  of  the  random  variable  S  (see  [Papoulis,  1991,  pp. 
185])  as  I  — >  oo  (see  [Stark  and  Woods,  1986,  pp.  252]).  Thus,  as  /  — »  oo,  cz  increases 
for  all  z  f-b. 

A  similar  result  holds  true  for  the  target  samples.  Further,  since  the  ROC  curve  density 
combines  the  target  and  non-target  posterior  densities  (see  Equation  (3.17)),  the  ROC 
curve  density  also  narrows  (the  ROC  curve  density  evaluated  at  a  given  false  alarm 
probability  approaches  a  dirac  distribution)  as  sample  size  increases. 

Figure  3.8  shows  the  final  step  in  the  generation  of  the  ROC  curve  density.  Based  on  the 
posterior  density  calculations  for  the  target  and  non-target  parameters  (i.e.,  the  mean  and 
standard  deviation  for  a  beta  density),  an  approximation  of  the  ROC  curve  density  is 
developed.  A  selected  target  density  of  score,  a  selected  non-target  density  of  score,  and 
a  varying  threshold  forms  a  ROC  curve  and  has  a  weight.  Many  sets  of  selections  result 
in  many  ROC  curves,  each  with  a  weight  WkWm.  The  figure  shows  curves  that  represent 
S[y  —  rk,m(x]  w/,:,i'm)]  f°r  five  selected  k  and  m  pairs.  The  weighted  summation  of 
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False  alarm  probability  x 

Figure  3.8  Weighted  ROC  curves.  Based  on  the  posterior  density  approximations  for  the 
target  and  non-target  parameters  values  (i.e.,  the  mean  and  standard  deviation 
for  a  beta  density),  an  approximation  of  the  ROC  curve  density  is  developed. 
The  combination  of  a  selected  target  density  of  score  and  a  selected  non¬ 
target  density  of  score  forms  a  ROC  curve  and  has  a  weight.  Many  sets  of 
selections  results  in  many  ROC  curves,  each  with  a  weight  WkWm.  Here 
only  five  weighted  ROC  curves  are  shown;  for  a  large  number  of  weighted 
ROC  curves  many  descriptive  statistics  may  be  computed,  such  as  median 
estimates  for  the  ROC  curve,  confidence  intervals  for  the  ROC  curve,  median 
estimates  for  the  AUC  value,  and  confidence  intervals  for  the  AUC  value. 
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S[y  —  rk,m(x;  UkVm)}  for  k  and  m  selected  from  the  admissible  set  is  described  by 
Equation  (3.28).  Five  ROC  curves  are  shown;  a  much  larger  number  of  weighted  ROC 
curves  are  needed  to  represent  a  ROC  probability  density  model  (approximately  10,000 
ROC  curves  are  typically  employed).  As  K  and  M  become  large,  these  weighted  curves 
approximate  the  analytical  ROC  surface  density.  In  particular,  if  a  large  number  of 
S[y  —  rk/m(x\  UkVm)\  functions  for  selected  k  and  m  pairs  are  each  replicated  a  number 
of  times  proportional  to  WkWm ;  then  the  set  of  replicated  functions  represents  the  density 
of  ROC  curves  (as  the  preceding  theorem  indicates).  For  a  large  number  of  weighted 
ROC  curves,  many  descriptive  statistics  may  be  computed,  such  as  median  estimates  for 
the  ROC  curve,  confidence  intervals  for  the  ROC  curve,  median  estimates  for  the  AUC 
value,  and  confidence  intervals  for  the  AUC  value  as  detailed  in  Chapter  3.  This 
outcome  extends  in  a  straightforward  manner  to  the  CEG  curve,  and  Section  4.2.5  applies 
the  method  described  here  to  CEG  curves. 

The  above  discussion  is  self-contained  in  that  an  analytical  ROC  curve  density  process  is 
developed.  Necessary  inputs  include  non-target  and  target  samples,  specified  density 
models  for  the  target  and  non-target  samples,  and  prior  densities  for  the  parameters  of  the 
models.  The  selection  of  evaluation  points  for  the  prior  densities  enables  a  numerical 
estimate  of  the  ROC  curve  density. 

The  upper  left  plot  of  Figure  3.9  shows  selected  target  parameter  points  (circles)  and  the 
upper  right  plot  shows  example  non-target  parameter  points  (circles).  The  lower  left  plot 
shows  target  densities  (solid  curves)  and  non-target  densities  (dashed  curves)  for  these 
points,  and  the  lower  right  plot  shows  the  ROC  curves  formed  by  combinations  of  these 
curves:  out  of  the  64  possible  pairs,  the  44  are  chosen  that  have  the  highest  posterior 
parameter  density.  The  plots  demonstrate  that  a  slight  shift  in  parameter  value  impacts 
density  shape  and  the  corresponding  ROC  curve.  As  increasing  numbers  of  target  and 
non-target  samples  are  drawn,  the  densities  that  fit  the  samples  well  using  Bayesian 
posterior  density  evaluation  converge.  Since  a  sequence  of  random  variables  converges 
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Figure  3.9  Parameter  variation  with  corresponding  densities  and  ROC  curves.  The 
upper  left  plot  shows  parameter  points  that  select  target  densities,  and  the 
upper  right  plot  shows  parameter  points  that  select  non-target  densities.  The 
lower  left  plot  shows  target  (solid  curves)  and  non-target  (dashed  curves) 
densities  for  these  points,  and  the  lower  right  plot  shows  the  corresponding 
ROC  curves. 
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in  distribution  as  the  number  of  samples  becomes  large  (see  Definition  5.5.10  of  [Casella 
and  Berger,  2002,  pp.  235])  assuming  that  the  samples  are  i.i.d,  the  range  of  densities  that 
have  high  likelihoods  (i.e.,  that  fit  the  samples  well)  narrows  as  sample  size  increases. 

An  increase  in  sample  size  is  observed  experimentally  to  enable  large  regions  of  standard 
deviations  and  means  to  be  disregarded,  because  the  corresponding  posterior  density 
regions  have  low  magnitude  (see  the  above  theorem). 

Note  that  parameter  evaluation  points  uniformly  spaced  for  one  parameter  choice  may 
not  be  uniformly  spaced  for  other  parameter  choices.  Figure  3.10  plots  points  uniformly 
spaced  over  variance  and  mean  rather  than  standard  deviation  and  mean,  and  then 
converts  these  points  to  standard  deviation  and  mean.  Comparison  with  Figure  3.6  shows 
that  these  points  are  now  more  concentrated  at  larger  standard  deviations.  Figure  3.1 1 
examines  posterior  probability  density  over  the  beta  density  parameters  a  and  b  rather 
than  mean  and  standard  deviation.  As  a  and  b  increase,  density  width  generally 
decreases,  which  initially  provides  better  fit  to  samples  for  selected  means,  until  a 
maximum  posterior  parameter  weight  is  reached,  beyond  which  the  target  and  non-target 
densities  have  variance  too  small  to  adequately  fit  the  samples.  Thus,  selecting  points 
uniformly  over  a  and  b  requires  different  prior  assumptions  than  selecting  points 
uniformly  over  mean  and  standard  deviation. 

In  this  chapter,  performance  metric  probability  densities  have  been  developed;  Chapter  4 
leverages  these  densities  to  obtain  and  verify  confidence  intervals  and  other  descriptive 
statistics. 
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Figure  3.10 


Uniformly  spaced  parameter  selection  over  variance  and  mean  compared 
with  selection  over  standard  deviation  and  mean.  The  curves  in  both  plots 
enclose  allowed  beta  density  parameters.  The  points  that  are  uniformly 
spaced  in  variance  and  mean  are  transferred  to  standard  deviation  versus 
mean  in  the  lower  plot.  Note  that  while  the  curves  are  of  different  shape, 
the  limits  of  a  and  a2  are  both  defined  by  the  admis  sable  set  of  Equation 
(3.3)  (the  difference  in  shape  is  simply  a  result  of  using  a  vertical  axis  of  a 
rather  than  a2). 
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Figure  3.11  Beta  posterior  parameter  densities  that  compare  a  and  b  versus  a  and  /j 
parameters.  The  bottom  plot  is  as  in  Figure  3.7  but  for  a  different  set  of 
target  and  non-target  samples.  The  top  plot  shows  that  as  a  and  b  increase, 
the  density  width  generally  decreases,  which  initially  provides  better  fit  to 
samples  for  selected  means,  until  a  maximum  posterior  parameter  weight 
is  reached  (here  at  a  =  55,  b  =  15),  beyond  which  the  target  and  non-target 
densities  have  variance  too  small  to  adequately  fit  the  samples. 
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4.  Probability  Density  Characterization  and  Verification 


The  method  of  Chapter  2  generates  densities  for  detection  system  performance  metric 
curves,  such  as  the  ROC  curve.  Various  descriptive  statistics  then  characterize  these 
densities;  examples  of  such  statistics  are  confidence  contours  for  the  ROC  curve  and 
confidence  interval  limits  for  the  AUC  value.  Following  the  development  of  such 
characterization  methods,  a  Monte  Carlo  approach  estimates  their  accuracy  using  various 
examples.  Coverage  accuracy  and  alpha  are  used  to  test  whether  or  not  the  defined 
confidence  interval  limits  are  accurate  over  a  large  number  of  trials.  For  example, 
suppose  that  30  target  samples  and  30  non-target  samples  generate  a  ROC  curve.  Then, 
based  on  only  these  60  samples,  a  ROC  curve  probability  density  and  90%  confidence 
intervals  can  be  developed.  The  90%  confidence  intervals  are  intended  to  enclose  the  true 
ROC  curve  90%  of  the  time.  This  outcome  can  be  tested  by  generating  30  target  samples 
and  30  non-target  samples  many  times,  producing  confidence  intervals  for  each  run,  and 
calculating  the  percentage  of  runs  in  which  the  confidence  intervals  enclose  truth.  The 
coverage  accuracy  and  alpha  metrics  are  of  particular  interest  because  they  provide 
quantitative  means  to  compare  the  method  developed  here  with  methods  in  the  literature. 

4.1  Development  of  descriptive  statistics 

4.1.1  The  AUC  value  densities  and  confidence  intervals.  The  following  process  maps 
the  weighted  ROC  curves  shown  in  Figure  3.8  to  AUC  value  uncertainty.  Recall  that  if 
the  target  and  non-target  density  parameters  Uk  and  vm  are  specified  as  described  in 
Equation  (3.16),  then  a  deterministic  ROC  curve  results.  Further,  a  representative  set  of 
(, k ,  m)  pairs  results  in  a  representative  set  of  ROC  curves.  Chapter  2  describes  a  process 
for  generating  such  ROC  curves  (see  Figure  3.8).  First,  find  the  ROC  curve  r(x;  ukvm ) 
for  each  selected  (k,  m)  pair,  where  k  and  m  identify  one  of  the  K  parameters  uk  and  one 
of  the  M  parameters  vrn.  Second,  replicate  each  curve  a  number  of  times  proportional  to 
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its  posterior  parameter  weighting  WkWm,  which  is  defined  in  Equation  (3.15).  Finally, 
calculate  for  each  ROC  curve  a  corresponding  AUC  value  as 


AUC(ukvm) 


r(x-,ukvm)dx, 


(4.1) 


where  y  and  x  are  correct  detection  probability  and  false  alarm  probability,  respectively, 
of  the  ROC  curve;  that  is,  y  =  r(x;  Uk,vm)  is  the  ROC  function. 


Confidence  intervals  for  the  AUC  values  are  developed  as  follows.  Center  an  impulse 
probability  density  function  at  each  observed  AUC  value.  Add  and  normalize  all  impulse 
functions  such  that  the  result  is  a  probability  density.  Denote  this  density  pz(z),  where  z 
is  the  domain  of  possible  AUC  values.  Begin  at  an  AUC  test  value  of  0  and  increase  until 
the  AUC  test  value  is  found  such  that  the  integral  of  pz(z)  from  0  to  the  AUC  test  value  is 
0.05.  This  test  value  is  a  lower  90%  AUC  confidence  interval.  Similarly,  begin  at  an 
AUC  test  value  of  1  and  decrease  until  the  AUC  test  value  is  0.05.  This  test  value  is  an 
upper  90%  AUC  confidence  interval.  These  following  equations  describe  the  process: 


rtest  valueiower  rl 

/  pz(z)dz  =  0.05,  /  pz(z)dz  =  0.05  (4.2) 

J  0  test  IfflilfCypper 

In  practice,  the  impulse  function  is  obviously  not  practical  to  evaluate  numerically. 
Instead,  compute  the  lower  AUC  confidence  interval  by  starting  at  an  AUC  test  value  of 
zero  and  stopping  when  5%  of  the  observed  values  are  obtained,  thereby  approximating 
the  inclusion  of  5%  of  the  total  impulse  functions  that  are  used  to  form  pz(z).  Proceed 
similarly  for  the  upper  AUC  confidence  interval. 

Note  that  a  two-tail  equal  area  approach  is  described  here.  Other  approaches  considered 
by  Ross  [Ross,  2003]  describe  alternative  confidence  interval  definitions.  Note  also  that 
a  median  ROC  curve  is  generated  by  beginning  at  an  AUC  test  value  of  0,  increasing  the 
test  value  until  the  integral  over  the  AUC  value  density  from  0  to  the  test  value  is  0.5,  and 
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specifying  the  ROC  curve  that  corresponds  to  the  test  value  as  the  median  ROC  curve 
ranked  by  AUC  value.  Note  finally  that  the  AUC  value  density  is  not  typically 
symmetric,  making  a  normal  approximation  approach  in  lieu  of  the  above  computation 
undesirable. 

Figure  4.1  shows  a  histogram  of  AUC  values,  where  each  AUC  value  is  weighted  by  its 
ROC  curve  weight  as  indicated  in  Figure  3.8.  This  histogram  estimates  the  AUC  value 
density  given  a  set  of  target  samples  and  non-target  samples,  assumed  forms  for  the 
densities  of  score,  assumed  prior  parameter  densities,  and  specified  sampling  protocols. 

A  method  that  generates  a  ROC  curve  90%  confidence  band  from  AUC  value  densities  is 
described  in  Section  4.1.2.  Another  method  that  generates  a  ROC  curve  90%  confidence 
band  from  the  weighted  ROC  curve  density  without  use  of  AUC  values  is  described  in 
Section  4.1.3. 

4.1.2  Rank  characterization  of  ROC  curves  by  AUC  values.  The  ROC  curve 
confidence  contours  shown  in  Figure  4.2  are  obtained  as  follows.  First,  find  the  lower 
and  upper  90%  confidence  intervals  for  AUC  value  (as  explained  in  the  previous  section). 
Next  find  the  ROC  curve  closest  to  the  lower  90%  AUC  confidence  interval  test  value 
(see  Equation  (4.2)),  and  the  ROC  curve  closest  to  the  upper  90%  AUC  confidence 
interval  test  value.  These  two  ROC  curves  form  the  lower  and  upper  limits  of  a  90% 
confidence  band.  For  the  median  or  50%  ROC  curve,  find  the  median  AUC  value,  and 
then  find  the  ROC  curve  that  has  an  AUC  value  closest  to  this  median  value. 

Figure  4.2  provides  no  new  information  beyond  that  given  by  the  ROC  curve  density  of 
Figure  3.8,  and,  in  fact,  Figure  4.2,  unlike  Figure  3.8,  does  not  indicate  the  shape  of  the 
ROC  curve  density  (Figure  3.8  provides  the  entire  ROC  curve  density,  in  contrast  Figure 
3.8  only  provides  confidence  intervals  that  constitute  a  summary  or  partial  description  of 
this  full  ROC  curve  density).  However,  the  ROC  curve  confidence  intervals  and  the 
median  ROC  curve  shown  in  Figure  4.2  are  useful.  For  example,  for  a  selected  false 
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Figure  4. 1  An  AUC  value  histogram.  This  histogram  is  based  on  30  target  and  30  non¬ 
target  samples.  After  the  replication  of  each  representative  ROC  curve  (as 
in  Figure  3.8)  a  number  of  times  proportional  to  its  weight,  an  AUC  value 
is  calculated  for  each  curve.  For  this  example  the  underlying  densities  are 
known  (but  not  used  in  the  histogram  development),  and  the  true  AUC  value 
is  0.882.  The  AUC  value  is  a  single  summary  metric  used  to  compare  differ¬ 
ent  SUTs,  and  here  an  extention  is  made  to  a  density  estimate  in  the  form  of 
a  histogram  of  AUC  values. 
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Figure  4.2  Rank  characterization  for  ROC  curves  weighted  by  AUC  values.  Once  the 
ROC  curve  density  is  developed,  there  are  many  possible  definitions  of  ROC 
curve  confidence  bands  or  ROC  curve  confidence  interval  contours.  The  90% 
ROC  curve  confidence  interval  contours  shown  here  are  obtained  by  finding 
the  ROC  curve  that  has  the  AUC  value  closest  to  the  lower  90%  AUC  value 
confidence  bound  and  the  ROC  curve  that  has  the  AUC  value  closest  to  the 
upper  90%  AUC  value  confidence  bound.  The  median  ROC  curve  is  the 
ROC  curve  that  has  the  AUC  value  closest  to  the  median  (50%)  AUC  value, 
and  the  true  ROC  curve  (the  ROC  curve  for  the  target  and  non-target  densities 
from  which  the  samples  are  drawn)  is  also  shown. 
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alarm  probability,  the  90%  confidence  bands  of  correct  detection  probability  for  two 
SUTs  can  be  compared.  In  particular,  if  one  SUT  has  a  median  ROC  curve  which  has 
greater  correct  detection  probability  at  the  selected  false  alarm  probability  than  a  second 
SUT,  and  if  the  confidence  intervals  of  both  SUTs  at  this  false  alarm  probability  do  not 
overlap,  then  the  first  SUT  is  more  desirable  than  the  second  with  at  least  90% 
confidence.  The  confidence  interval  at  false  alarm  probabilities  approaching  zero  or  one 
necessarily  becomes  narrow,  because  a  ROC  curve  by  definition  has  correct  detection 
probability  of  zero  at  false  alarm  probability  of  zero  and  correct  detection  probability  of 
one  at  false  alarm  probability  of  one.  In  particular,  in  Equations  (2.2)  and  (2.3)  for 
correct  detection  and  false  alarm  probability,  respectively,  let  t  =  -oo  (or  in  the  case  of 
sG  [0, 1],  let  t  =  0).  Then  correct  detection  probability  equals  one  and  false  alarm 
probability  equals  one.  Let  t  =  oo  (or  in  the  case  of  sG  [0, 1],  let  t  =  1).  Then  correct 
detection  probability  equals  zero  and  false  alarm  probability  equals  zero. 

The  confidence  band  method  that  Figure  4.2  illustrates  compares  favorably  with  a 
confidence  band  formed  by  a  pair  of  error  bar  contours,  where  such  contours  are  based  on 
the  standard  deviation  of  the  ROC  curve  density  at  a  given  false  alarm  probability.  Such 
error  bars  may  extend  outside  the  zero  to  one  range  of  correct  detection  probability  and 
do  not  make  appropriate  allowances  for  skewed  distributions.  Methods  in  the  recent 
literature  that  go  beyond  simple  error  bars  (such  as  [Zhou  and  Qin,  2005])  may  also 
extend  beyond  allowed  regions,  e.g.,  to  correct  detection  probabilities  greater  than  one. 
Two  advantages  of  the  ROC  curve  confidence  bands  described  in  this  section  are  that 
they  do  not  require  the  selection  of  an  independent  variable  (such  as  false  alarm 
probability),  and  the  confidence  bands  generated  are  true  ROC  curves. 

Once  a  density  of  ROC  curves  is  developed,  there  are  many  possible  definitions  of  ROC 
curve  confidence  intervals  or  confidence  interval  bands  (in  addition  to  many  ways  to 
compute  these  intervals  or  bands).  Methods  described  in  the  literature  typically  are 
applicable  to  only  one  or  a  small  subset  of  these  definitions.  In  contrast,  the  approach 
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taken  here  of  forming  ROC  curve  densities  first  and  then  transitioning  to  descriptive 
statistics  can  handle  a  variety  of  definitions.  Ma  [Ma  and  Hall,  1993]  emphasizes  the 
need  for  approaches  that  may  be  applied  to  multiple  confidence  definitions.  The  next 
section  details  the  primary  method  of  confidence  interval  estimation  used  in  this  research. 


4.1.3  Characterization  of  ROC  curve  density.  With  false  alarm  probability  as  the 
independent  variable,  the  following  procedure  generates  a  ROC  curve  density 
characterization.  First,  find  the  density  of  correct  detection  probability  at  a  selected  false 
alarm  probability.  Second,  repeat  for  all  possible  false  alarm  probabilities.  Finally, 
generate  a  normalized  combination  of  all  such  densities  to  form  a  ROC  probability 
density. 

The  density  of  correct  detection  probability  at  a  given  false  alarm  probability  is  found  as 
discussed  in  Section  3.2,  where  each  ROC  curve  is  replicated  a  number  of  times 
proportional  to  the  posterior  parameter  weighting  wkwm,  given  by  Equation  (3.15),  and 
let  N„.r0(.  equal  the  number  of  replicated  ROC  curves.  Note  that  each  ROC  curve  gives 
one  correct  detection  probability  value  at  any  selected  false  alarm  probability.  A  density 
of  correct  detection  probability  may  be  generated  by  using  each  of  the  N„,roc  correct 
detection  probabilities  as  observations  of  some  unknown  density,  where  N„,r0(,  is  the 
number  of  replicated  ROC  curves,  and  by  estimating  the  density  of  correct  detection 
probability  based  on  these  observations.  The  upper  plot  of  Figure  4.3  shows  such  an 
estimate  based  on  a  beta  density  model,  and  the  lower  plot  shows  contours  of  equal 
density.  Figure  4.4  shows  similar  plots  for  the  true  ROC  curve  with  a  lower  AUC  value. 

The  ROC  curve  density  developed  here  specifies  false  alarm  probability  as  the 
independent  variable.  However,  it  is  also  acceptable  (although  not  as  consistent  with 
common  practice)  to  select  correct  detection  probability  as  the  independent  axis  and  to 
find  the  density  of  false  alarm  probability  at  every  correct  detection  probability. 
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Figure  4.3  A  ROC  curve  density.  The  upper  plot  estimates  the  ROC  curve  density 
formed  from  30  target  scores  and  30  non-target  scores.  Correct  detection 
probability  is  normalized  so  that  for  each  false  alarm  probability  the  integral 
of  correct  detection  probability  is  one.  The  resulting  correct  detection  den¬ 
sity  at  each  selected  false  alarm  probability  is  smoothed  by  a  beta  density 
that  has  the  same  mean  and  variance  as  the  correct  detection  probabilities  of 
the  replicated  ROC  curves.  The  lower  plot  shows  equal  density  contours. 
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Figure  4.4  A  ROC  curve  density.  This  figure  is  similar  to  Figure  4.3,  except  that  here 
the  set  of  30  target  scores  and  30  non-target  scores  are  selected  from  different 
underlying  target  and  non-target  densities.  These  densities  are  such  that  the 
true  ROC  curve  has  a  lower  AUC  value  than  is  the  case  in  Figure  4.3. 


4.1.4  Confidence  contours  for  ROC  curve  density.  The  ROC  curve  density  developed 
in  Chapter  2  permits  computation  of  confidence  contours.  Consider  the  N„.rof:  correct 
detection  probabilities  at  a  specified  false  alarm  probability.  Create  a  density  based  on 
these  Nu,roc  values  by  centering  an  impulse  (or  delta  function)  density  at  each  of  the 
correct  detection  probabilities,  and  normalize  the  combination  of  all  N wroc  impulses  so 
that  they  form  a  probability  density.  Start  at  a  correct  detection  probability  of  zero,  and 
increase  it  until  5%  of  the  correct  detection  density  is  enclosed.  The  correct  detection 
probability  where  this  result  occurs  is  a  90%  lower  confidence  interval.  Similarly,  start  at 
correct  detection  probability  of  one  and  decrease  it  until  5%  of  the  correct  detection 
density  is  enclosed  to  find  a  90%  upper  confidence  interval.  Repeat  for  all  false  alarm 
probabilities.  The  continuum  loci  of  all  90%  lower  confidence  intervals  specifies  a  90% 
lower  confidence  contour,  and  the  loci  of  all  90%  upper  confidence  intervals  specifies  a 
90%  upper  confidence  contour.  The  two  contours  enclose  a  90%  confidence  band,  and 
are  shown  in  the  upper  and  lower  plots  of  Figure  4.5.  The  upper  plot  uses  10  target 
samples  and  10  non-target  samples  as  inputs,  and  the  lower  plot  uses  30  target  samples 
and  30  non-target  samples  as  inputs  (these  samples  are  similar  to  those  shown  in  Figure 
4.3). 

The  contours  are  expressed  as  follows.  Let  py\x{y\x,  d,  h )  denote  the  ROC  density.  Then 
90%  confidence  interval  for  y  at  a  particular  x,  or  [xf),  for  a  set  of  target  samples  (. d )  and 
non-target  samples  (h)  are  found  using 


r^lower 

C Ilowerijfllower’)  %i,d^  /i)  /  Py\(x,d,h)  d^  tl)dy 

JO 


(4.3) 


(J I  upper  upper  i  d,h)  —  /  Pv\(x,d,h)(y\xi,d,h)dy 

j  mUpper 

and  solving  for  C'/^,er(0.05;  xi:  d,  h )  and  C7“p^er(0.05;  xitd,  h). 


(4.4) 
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Figure  4.5  Confidence  intervals  with  false  alarm  probability  as  the  independent  vari¬ 
able  for  two  sample  sizes.  A  90%  lower  confidence  interval  is  developed 
from  the  ROC  curve  density  by  fixing  a  false  alarm  probability,  starting  at 
a  correct  detection  probability  of  zero,  and  increasing  the  correct  detection 
probability  until  5%  of  the  density  area  is  encompassed.  Similarly,  a  90% 
upper  confidence  interval  is  developed  by  fixing  a  false  alarm  probability, 
starting  at  a  correct  detection  probability  of  one,  and  decreasing  the  correct 
detection  probability  until  5%  of  the  total  correct  detection  probability  is  en¬ 
compassed.  The  median  contour  (i.e.,  the  locus  of  points  that  encompass 
50%  of  correct  detection  probability)  and  the  true  ROC  curve  (for  the  target 
and  non-target  densities  from  which  the  samples  are  drawn)  are  also  shown. 
In  the  upper  plot  10  samples  of  target  and  10  samples  of  non-target  are  used, 
and  in  the  lower  plot  30  samples  of  target  and  30  samples  of  non-target  are 
used. 
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Figure  4.5  shows  the  general  effect  on  confidence  interval  of  an  increase  in  sample  size. 
The  confidence  interval  widths  become  smaller  as  the  number  of  samples  increases.  The 
factors  that  regulate  ROC  curve  density  (and  confidence  interval)  widths  are 

UfMU  g(sj ;  v), where  /  is  the  non-target  density,  g  is  the  target  density,  ,st  are  the 

i  j 

non-target  samples,  Sj  are  the  target  samples,  and  u  and  v  are  the  specified  parameters 
that  define  /  and  g.  As  the  number  of  target  and  non-target  samples  increases,  the  range 
of  densities  with  a  high  weight  value  as  defined  by  these  function  decreases. 

The  density-based  ROC  curve  confidence  interval  generation  method  developed  here 
constitutes  an  improvement  over  other  methods  described  in  the  literature,  as  the 
intervals  here  have  more  useful  definitions.  Many  existing  methods  attempt  to  describe 
the  uncertainty  in  probability  of  correct  detection  y  at  a  specific  probability  of  false  alarm 
x ,  but  do  not  permit  extrapolation  to  confidence  bands  because  they  either  fail  to 
incorporate  or  incorporate  conservatively  the  underlying  uncertainty  in  the  variable  x. 

The  non-target  density  yields  this  uncertainty  as  a  simple  outcome  of  the  Bayesian 
approach  in  the  method  developed  here.  Other  existing  methods  incorporate  uncertainty 
in  both  y  and  x,  but  restrict  threshold  to  a  single  value  or  make  assumptions  that  are  only 
valid  for  particular  density  forms  (see  [Linnet,  1987],  [Campbell,  1994],  and 
[Platt  et  al.,  2000]).  In  the  method  described  here,  threshold  is  eliminated  as  a  variable, 
which  removes  the  need  to  restrict  threshold  to  a  single  value  and  retains  uncertainty  in 
the  independent  variable  x. 

A  confidence  accuracy  measure  designated  alpha  tests  ROC  curve  confidence  interval 
accuracy.  Alpha  describes  the  percentage  of  trials  where  the  confidence  interval  does  not 
enclose  truth.  One  set  of  target  samples  and  non-target  samples  define  one  trial,  a  second 
set  of  target  samples  and  non-target  samples  define  a  second  trial,  etc.  An  ideal  alpha  is 
one  minus  the  intended  confidence  interval  coverage.  The  example  in  Figure  4.5  claims 
90%  confidence  intervals,  and  thus  the  ideal  alpha  is  0.1.  If  the  underlying  target  and 
non-target  densities  generate  the  same  number  of  target  and  non-target  samples  an 
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infinite  set  of  times,  the  truth  ideally  departs  from  the  confidence  interval  10%  of  the 
time.  This  confidence  region  accuracy  evaluation  process  extends  to  a  confidence  band, 
where  contours  defined  by  the  confidence  intervals  at  every  false  alarm  probability  define 
the  band.  The  process  here  assumes  that  each  false  alarm  probability  has  an  equal 
contribution  to  an  overall  alpha  measure.  For  example,  if  the  true  ROC  curve  lies  outside 
the  generated  confidence  band  for  25%  of  the  false  alarm  probabilities  for  each  of  an 
infinite  set  of  ROC  curve  estimates,  then  alpha  is  0.25.  An  alternative  approach  declares 
"failure"  if  any  portion  of  the  ROC  curve  confidence  band  lies  outside  of  the  confidence 
band  for  any  false  alarm  probability  for  a  particular  run.  With  this  alternative  approach, 
if  any  portion  of  the  true  curve  deviates  from  the  ROC  curve  confidence  band  on  40%  of 
an  infinite  set  of  generated  ROC  curve  confidence  bands,  then  alpha  is  0.40. 

Confidence  interval  accuracy  does  not  necessarily  increase  with  increase  in  sample  size. 
Consider  two  extreme  cases.  First,  evaluate  a  ROC  curve  estimate  with  infinitely  small 
confidence  interval  widths  that  are  ideally  90%  confidence  intervals.  The  ROC  curve 
estimate  may  be  close  to  truth,  but  the  confidence  band  is  always  above  or  below  the  true 
ROC  curve,  resulting  in  an  average  alpha  of  1.  Next,  consider  a  ROC  curve  estimate  far 
from  truth,  but  which  has  the  largest  possible  confidence  interval  widths.  For  example,  at 
every  false  alarm  probability,  the  90%  confidence  interval  limits  are  0  and  1,  in  which 
case  alpha  is  0.  In  a  related  consideration,  note  that  a  confidence  interval  calculation 
approach  that  produces  an  alpha  of  0.1  (for  claimed  90%  confidence  intervals)  is 
generally  better  than  an  approach  that  produces  an  alpha  of  0. 

Let  rtrue(x )  be  the  true  ROC  curve  (in  test  cases  where  the  density  that  generates  the 
target  and  non-target  samples  is  known),  let  ca(x )  be  the  actual  coverage  accuracy 
defined  by  Equation  (4.5),  let  C7to<1)er(m;  x)  and  CIupper{m\  x)  be  as  defined  by 
Equations  (4.3)  and  (4.4).  Then 

ca(m,  x)  P  {O f/ou,er(/7Z,  x)  <C  T true( ^ Iupper(^TTl1  x^j^  .  (4.5) 
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Estimates  for  P(rtrUe(x)  >  C  Iiawer{m\  x))  and  P(rtrue(x)  <  CIupper(m;x))  maybe 
found  by  generating  many  sets  of  identical  numbers  of  samples  from  the  same  target  and 
non-target  score  densities,  to  approximate  the  probabilities  noted  in  Equation  (4.5).  In 
particular,  let  ccvdesired  be  the  desired  confidence  interval  coverage  (in  the  case  of  90% 
confidence  intervals,  ccvdesired  =  0.90),  and  let  ad,  alpha  desired,  be  one  minus  the 
desired  confidence  interval  coverage.  Then 


alpha(m ) 


1  - 


ca(m,  x)  —  ad]dx. 


(4.6) 


4.1.5  Relations  of  confidence  interx’als  to  Chebyshev’s  inequality.  Three  separate 
relations  of  Chebyshev’s  inequality  to  confidence  intervals  follow. 

The  first  relation  is  established  in  Theorem  4. 1  and  shows  that  the  upper  and  lower 
bounds  of  the  confidence  interval  contours  developed  in  Section  4.1.4  are  within  the 
constraints  established  by  Chebyshev’s  inequality. 

Theorem  4. 1  Upper  and  lower  bounds  for  confidence  interval  contours 

Let  py\x(y\x)  be  as  developed  in  Theorem  3.2.  The  median  (see  [DeGroot  and  Schervish, 
2002,  pp.  210])  of  Py\x(y\x)  is  the  value  medy\x  such  that 

l*7ncdy\x  p  1 

/  Py\x(v\x)drj  =  /  Py\x(v\x)drj  =  0.5.  (4.7) 

J0  Jmedy\x. 


Let  PuptAx (y\x)  and  piawy{x(y\x)  be  symmetric  probability  densities  such  that 


Pupylx(v\x ) 


Pv\x(y\x)Vy  >  rnedy\x 
py\x((2medy\x  -  y)\x)\/y  <  medy\ 


(4.8) 


and 
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(4.9) 


Plo 


X  = 


py\x(y\x)\/y  <  medy\x 
py\x((2medy\x  -  y)\x)\/y  >  medy\3 


Also,  let  pPup^  (y\x)  =  mean  of  puPyU  (y\x),  ppiow^ 


|x)  =  mean  of  pi 


owy \x  [y\x), 


aPuPylx  (y\x)  =  standard  deviation  of  puPy]x{y\x),  and  crPlow^w  (: y\x )  =  standard  deviation  of 


PioW y]x  (y\x).  Finally  let  ru{x)  denote  the  upper  bound  on  the  (1  -  alpha )  upper 
confidence  interval  of  py\x(y |x)  and  let  rfx)  denote  the  lower  bound  on  the  (1  -  alpha ) 
upper  confidence  interval  of  py\x(y\x). 


Then 


ru(x)  <  medy\x  + 


(4.10) 


n{x)  >  medy\x  -  (aPh 


(4.11) 


Proof 

By  Chebyshev’s  inequality  (see  [Hogg  and  Craig,  1978,  pp.  59]),  for  k  >  0 


P(ru(x)  -  pPup^(y\x)  >  k(jPuPy]x{y\x ))  >  1 


1 

k 2 


(4.12) 


Thus 


P(ru(x )  >  kaPuPyix(y\x)  +  pPup^(y\x))  >  1 


1 

k2 


(4.13) 
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An  upper  bound  on  the  (1  -alpha)  upper  confidence  interval  specifies  that 

P(ru(x)  >  kaPuPyix(y\x)  +  fipup^(y\x))  >  1  -  ^ 

and  thus  1  -  £  =  1  -  k  = 

Here  pup  .  (y|x)  is  symmetric  in  y,  y  (y\x)  =  medy\x,  and  by  definition,  r„(. 

m/I® 

denotes  the  (1  -  alpha )  upper  confidence  interval. 

Thus  ru(x)  <  medy\x  +  (a^ 

Similarly,  by  Chebyshev’s  inequality  (see  [Hogg  and  Craig,  1978,  pp.  59]), 


P(VpuPylx  (V\x)  ~  n{x)  >  kapioWyix  (y\x))  >  1  - 

p(-r*(®)  >  ^^,>1*)  -  hPloWylx(y\x))  >  1  -  p, 

and 


p(n(z)  <  /ipw  ^(|/|a;)  -  fca-Kow^(y|a;))  >  1 


1 


A  lower  bound  on  the  (1  -alpha)  lower  confidence  interval  specifies  that 

Pir^x)  <  ypiow^(y\x)  -  kaPlow^{y\x))  >  1  - 

Here  puPy]x  (y\x)  is  symmetric  in  y,  ypiow  (y\x)  =  medy]x,  and 

y\x 

by  definition,  r/(x)  denotes  the  (1  -  alpha )  upper  confidence  interval. 


Thus,  n(x)  <  medy\x  +  (<jpiow^{y\x)y  alpha/ 


(4.14) 


(4.15) 


(4.16) 
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Figure  4.6  shows  a  plot  with  the  90%  confidence  intervals  developed  in  Section  4.1.4  and 
the  upper  and  lower  bounds  for  the  upper  and  lower  90%  confidence  intervals  as 
developed  in  this  Section. 

The  second  relation  of  confidence  intervals  to  Chebyshev’s  inequality  does  not  require 
the  Bayesian  progression  that  is  the  focus  of  the  research  presented  here,  but  it  results  in 
extremely  wide  (and  unformative)  confidence  bounds.  This  relation  is  established  as 
follows. 

For  a  given  set  of  target  samples  d,  a  given  set  of  non-target  samples  h,  and  a  selected 
alpha  (such  as  alpha  =  0.1),  find  a  target  sample  standard  deviation  at  and  a  non-target 
sample  standard  deviation  an,  and  find  the  upper  and  lower  bounds  on  the  target  mean  as 
follows.  From  Chebyshev’s  inequality  (see  [Hogg  and  Craig,  1978,  pp.  59]), 


P(|mean(of)  -  xt\  <  kat)  >  1  —  -n;  =  (1  —  alpha). 

kz 

Find  the  two  values  of  xt  such  that 

|mean(d)  -  xt\  <  kat. 


(4.17) 


(4.18) 


Similarly,  find  the  upper  and  lower  bounds  on  the  non-target  mean  by  solving  for  xn , 
where 


P(|mean(/i)  -  xr 


<  kan)  >  1  —  —  =  (1  —  alpha), 
k 1 


and  find  the  two  values  of  xn  such  that 


(4.19) 
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Figure  4.6  Upper  and  lower  bounds  on  90%  confidence  intervals  plus  ROC  curves  and 
coverage  for  a  selected  density  pair.  Here  beta  target  and  non-target  densities 
generate  30  target  and  30  non-target  samples  (the  densities  have  /r  =  0.805, 
<7  =  0.059  and  /r  =  0.715,  a  =  0.046,  respectively).  The  90%  confidence 
intervals  for  the  ROC  curve  developed  using  the  method  described  in  Section 
4.1.4  are  the  short  dashed  curves.  The  underlying  true  ROC  curve  is  the 
solid  curve,  the  median  ROC  curve  estimate  is  the  dash-dotted  curve,  and 
the  upper  and  lower  bounds  of  the  90%  confidence  intervals  are  the  heavy 
dashed  curves. 
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(mean (h)  -  xn 


<  kan. 


(4.20) 


This  approach  results  in  ROC  curve  uncertainty  estimates  that  are  extremely  wide  and 
uninformative,  even  when  the  target  and  non-target  standard  deviations  are  specified.  If 
uncertainty  in  the  target  and  non-target  standard  deviations  is  incorporated,  these  bounds 
will  only  become  wider  and  less  informative.  Figure  4.6  provides  an  example  for  30 
target  and  30  non-target  samples.  Here  it  is  assumed  that  the  standard  deviation  is 
constant  at  the  standard  deviation  of  the  target  and  non-target  samples,  and  a  target  and 
non-target  beta  density  model  is  assumed  (both  of  these  selections  can  only  narrow  the 
bands  compared  with  more  general  cases).  In  combinations  where  the  mean  and 
standard  deviation  pairs  are  outside  of  the  admissible  set  (of  allowable  means  and 
standard  deviations  for  a  beta  density),  the  standard  deviation  is  retained,  but  the  mean  is 
adjusted  (brought  closer  to  the  sample  mean)  so  that  the  resulting  mean  and  standard 
deviation  are  within  the  admissible  set.  This  adjustment  can  only  make  the  calculated 
bounds  more  narrow. 

Finally,  a  third  relation  of  confidence  intervals  to  Chebyshev’s  inequality  solves  for 
miower  and  mupper  such  that  (for  90%  confidence  bounds) 

('Slower 

C Ilowerij^lower-)  %,d^  h )  I  Py\(x,d,h)  (?/|^,  ^5  ll)dy  0.05  (4.21) 

Jo 

and  C 1 upper  {jYl upper ?  d ,  /l)  I  Py\(x,d,h)  (?/|*^)  fl^dy  0.05,  (4.22) 

TTLixpper 


where  miower  is  the  correct  detection  probability  y  that  produces  a  5%  lower  confidence 
interval  at  a  specified  false  alarm  probability  x,  for  a  set  of  target  samples  h  and  a  set  of 
non-target  samples  d,  and  mupper  is  the  correct  detection  probability  y  that  produces  a  5% 
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Correct  detection  probability 


Correct  detection  probability 


Figure  4.7  ROC  curve  uncertainty  example  with  Chebyshev’s  inequality.  ROC  curve 
estimates  are  produced  from  the  underlying  target  and  non-target  densities 
of  Figure  4.6.  Equations  4.17  through  4.20  are  applied  to  find  the  90% 
bounds  on  uncertainty  of  the  target  and  non-target  means.  The  standard 
deviation  of  the  target  and  non-target  samples  is  used,  and  target  and  non¬ 
target  densities  at  the  extremes  of  the  uncertainty  bounds  are  combined  to 
form  the  curves  shown  in  the  top  plot.  The  upper  and  lower  limits  of  these 
curves  form  confidence  bounds;  these  bounds  are  extremely  wide  (the  upper 
ROC  curve  has  an  AUC  value  «  1,  and  the  lower  ROC  curve  has  an  AUC 
value  ~  0).  The  four  lower  plots  show  two  of  the  four  sets  of  density  pairs  at 
the  uncertainty  bound  extremes.  In  the  bottom  right  plots,  the  ROC  curves 
that  correspond  with  the  underlying  target  and  non-target  densities  are  shown 
as  solid  curves,  and  the  curves  that  correspond  with  the  densities  at  left  are 
shown  as  dotted  curves. 


4-20 


upper  confidence  interval  at  a  specified  false  alarm  probability  x,  for  a  set  of  target 
samples  h  and  a  set  of  non-target  samples  d. 

From  Chebyshev’s  inequality  (see  [Hogg  and  Craig,  1978,  pp.  58]) 

or  f  .  a  m  ^  ^  E[fniower(x',  d,  h)] 

Py^loweryE •>  *0  —  Slower  \%)\  ^  7  7  5  (4.23) 

Clower  (•C  ) 

and  P[m„pper(x;  d,  h )  >  cupper(a;)J  < - — - ,  (4.24) 

Cupper  l  -  ^  ) 

where  ciower(x)  and  cupper(x)  are  lower  limits  to  the  lower  90%  confidence  interval  and 
upper  limits  to  the  upper  90%  confidence  interval.  This  progression  requires  the 
calcluation,  based  on  one  set  of  target  and  non-target  samples,  of  the  expected  value  of 
TTiiower(x;  d,  h )  and  mupper{x\  d,  h ).  Based  on  one  set  of  target  and  non-target  samples, 
the  best  estimate  is  E[miower{;x ;  d ,  /;,)]  =  miower(x ;  d,  h),  and 

E[mupper(x ;  d,  h)  =  rnupp(:r{x:  d,  h).  If  more  sets  of  samples  are  available,  then  these 
new  samples  may  be  incorporated  into  the  framework,  and  improved  confidence  intervals 
may  be  developed.  However,  py^x^h)(y\x,  d,  h )  is  already  the  defined  (actual)  posterior 
probability  density  for  the  ROC  curve  that  fully  incorporates  what  is  known  from  the 
observed  target  and  non-target  samples  (which  are  assumed  independent  and  identically 
distributed),  assumed  model,  and  assumed  priors.  Thus,  this  discussion  indicates  that  the 
target  and  non-target  samples  d  and  h  are  realizations  of  random  variables,  and  as  such 
the  developed  posterior  probability  density,  py\(x^h){y\x ,  d,  h)  may  be  (and  should  be) 
updated  if  additional  sets  of  representative  target  and  non-target  samples  are  available.  In 
any  case,  the  developed  posterior  probability  densities  (and  the  corresponding  confidence 
intervals  C'/;OM)er(m;ower;  x,  d,  h)  and  C Iupper(mupper]  x,  d,  h ))  are  actual  confidence 
intervals  based  on  the  available  samples,  assumed  model,  and  assumed  priors. 
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The  above  discussion  indicates  that  the  posterior  probability  density  is  a  full  summary  of 
what  is  known  about  the  ROC  curve  based  on  the  observed  sample  data,  the  assumed 
model,  and  the  assumed  priors.  Carlin  writes  [Carlin  and  Louis,  2000,  pp.  36]  that  a 
Bayesian  approach  "enables  direct  probability  statements  about  the  likelihood  of  6  falling 
in  [set]  C,  i.e.,  ’The  probability  that  0  lies  in  [set]  C  given  the  observed  data  y  is  at  least 
(1-a).’  This  is  in  stark  contrast  to  the  usual  frequentist  Cl,  for  which  the  corresponding 
statement  would  be  something  like,  If  we  could  recompute  [set]  C  for  a  large  number  of 
datasets  collected  in  the  same  way  as  ours,  about  (1-a)  x  100%  of  them  would  contain 
the  true  value  of  94  This  is  not  a  very  comforting  statement,  since  we  may  not  be  able  to 
even  imagine  repeating  our  experiment  a  large  number  of  times"  (the  use  of  [set],  in 
brackets,  has  been  inserted  here  for  clarity).  This  discussion  by  Carlin  is  applicable  to 
the  research  presented  here  if  the  actual  ROC  curve  is  denoted  as  6,  if  C  is  the  set  of  all 
real  values  such  that  miower  <  C  <  mupper^  if  y  refers  to  the  observed  target  and 
non-target  samples,  and  if  a  =  0.1  (for  90%  confidence  intervals).  Mac  Kay  [Mac  Kay, 
2003,  pp.  50]  summarizes  the  value  of  the  posterior  probability  distribution  strongly  in 
the  following  statement:  "The  posterior  probability  distribution  represents  the  unique  and 
complete  solution  to  the  problem.  There  is  no  need  to  invent  ’estimators’;  nor  do  we 
need  to  invent  criteria  for  comparing  alternative  estimators  with  each  other." 

4.1.6  Convergence  as  number  of  parameter  points  increases.  Wide  spacing  between 
the  prior  beta  density  mean  and  standard  deviation  points  for  target  densities  and/or 
non-target  densities  can  affect  the  size  of  the  confidence  band.  As  this  spacing 
approaches  zero  and  as  the  number  of  points  selected  therefore  approaches  infinity,  the 
confidence  band  area  converges  to  a  constant  (the  convergence  of  ROC  curve  density  is 
proven  in  Chapter  3;  the  confidence  intervals  are  then  deterministic  from  this  density).  A 
simple  example  of  this  process  is  shown  in  Figure  4.8.  Both  plots  have  as  inputs  the  same 
30  target  samples  and  the  same  30  non-target  samples.  The  plot  at  the  top,  labeled  coarse 
spacing,  develops  confidence  interval  contours  using  the  nine  highest- weighted  points 
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Figure  4.8  The  ROC  curve  confidence  interval  bands  versus  spacing  of  prior  beta  den¬ 
sity  mean  and  standard  deviation  values.  As  spacing  decreases  and  the  cor¬ 
responding  number  of  mean  and  standard  deviation  values  considered  there¬ 
fore  increases,  the  confidence  band  area  converges  to  a  limit.  An  example 
of  this  trend  for  a  95%  confidence  interval  with  false  alarm  probability  as  the 
independent  variable  is  shown  here.  Both  plots  use  as  inputs  the  same  30 
target  samples  and  the  same  30  non-target  samples.  The  plot  labeled  coarse 
spacing  develops  confidence  intervals  using  the  nine  highest- weighted  points 
uniformly  spaced  on  the  mean  and  standard  deviation  target  and  non-target 
beta  density  axes  such  that  the  ratio  of  the  weight  of  the  lowest  to  the  highest 
points  is  0.001.  The  plot  labeled  fine  spacing  develops  confidence  inter¬ 
vals  using  the  25  highest- weighted  points  uniformly  spaced  on  the  mean  and 
standard  deviation  target  and  non-target  beta  density  axes  such  that  the  ratio 
of  the  weight  of  the  lowest  to  the  highest  points  is  0.001. 
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uniformly  spaced  in  target  density  mean  and  standard  deviation  such  that  the  ratio  of  the 
posterior  density  (or  weight)  of  the  lowest  to  the  highest  is  0.001.  These  contours  define 
a  confidence  band.  Nine  highest-weighted  points  are  similarly  found  for  the  non-target 
density.  Note  that  if  only  one  point  for  target  density  and  one  point  for  non-target  density 
is  used,  the  confidence  band  area  is  0  because  the  ROC  curve  is  deterministic.  The  plot 
at  bottom,  labeled  fine  spacing,  develops  a  similar  confidence  band. 

Figure  4.9  shows  confidence  band  area  convergence  as  the  number  of  evaluated  points 
increases.  For  the  example  in  Figure  4.9,  target  standard  deviation  versus  mean  grid 
points  are  selected,  where  these  points  are  centered  around  the  mean  and  standard 
deviation  of  the  target  samples.  The  number  of  target  parameter  density  points  is 
increased  from  9  points  (3  target  means  and  3  target  standard  deviations)  to  25  points  (5 
target  means  and  5  target  standard  deviations),  etc.,  up  to  a  total  of  1089  points  (33  target 
means  and  33  target  standard  deviations).  Each  set  of  points  is  used  to  calculate 
confidence  bands.  The  confidence  band  area  converges  (the  convergence  of  ROC  curve 
density  is  proven  in  Section  3.2;  the  confidence  intervals  are  then  deterministic  from  the 
density)  as  the  number  of  parameter  points  increases,  which  indicates  that  point  spacing 
does  not  bias  the  prior  parameter  densities  if  the  points  are  selected  uniformly  over  the 
target  and  non-target  density  parameters  (such  as  mean  and  standard  deviation). 

4.1.7  Additional  confidence  bound  definitions.  Note  that  the  method  developed  here 
extends  to  an  additional  class  of  confidence  bounds  that  are  not  described  elsewhere. 
These  confidence  bounds  describe  ROC  curves  for  a  threshold  selected  at  random,  with 
uniform  probability  of  selection  over  allowable  thresholds,  where  the  bounds  are  formed 
such  that  the  integral  of  the  ROC  curve  density  above  a  specified  value  has  the  given 
percentage  of  unit  density  variance.  Such  bounds  are  an  extension  of  the  method 
developed  here. 
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for  constant  non-target  parameter  density. 


Figure  4.9  Confidence  band  area  versus  number  of  evaluated  points.  Here  a  beta  score 
target  density  with  mean  of  0.805  and  standard  deviation  of  0.059  and  a  beta 
non-target  score  density  with  mean  of  0.715  and  standard  deviation  of  0.046 
generate  300  target  and  300  non-target  samples.  The  method  of  Section 
4.1.3  estimates  the  ROC  curve  confidence  band.  The  non-target  posterior 
parameter  density  is  evaluated  at  a  single  point.  The  target  density  is  mod¬ 
eled  by  3  means  and  3  standard  deviations  (9  points),  5  means  and  5  stan¬ 
dard  deviations,  etc.,  where  the  mean  and  standard  deviations  of  the  selected 
points  for  the  3  mean  and  3  standard  deviation,  5  mean  and  5  standard  devi¬ 
ation,  and  33  mean  and  33  standard  deviation  cases  are  shown  in  the  upper 
two  plots.  As  the  number  of  target  parameter  points  increases,  the  lower 
plot  shows  that  the  confidence  band  area  approaches  a  constant. 
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Let  py,x(y , x)  be  the  joint  density  of  the  ROC  curve  (rather  than  the  ROC  curve  density 
normalized  such  that  the  probability  density  of  correct  detection  at  given  false  alarm 
probability  is  one),  as  described  by  Equation  (3.17),  except  here  replace  py\x(y\x)  with 
Py,x{y,  x).  Let  c.c.  be  the  desired  coverage  (e.g.,  0.90).  Let  the  ROC  subset  (©(^i))  be 
the  subset  of  all  x,  y  pairs  such  that 


(■ x,y )  e  [0,  l]2  :  Py,x(y,x)  >  z1  ma x[pyx(y,x)\ 

x,y 


where  x  £  [0, 1],  y  £  [0, 1],  and  z\  £  [0, 1].  Let  zi  =  1  and  find 


(4.25) 


c.c.test  =  jj  py,x(y,x)dxdy.  (4.26) 

®Ci) 

Then  let  zinew=zi0id  —  £  if  c.c.test  <  c.c..  Re-define  the  ROC  subset  &(zinew )  for  this 
Z\new.  Repeat  the  process,  continuing  to  reduce  z\  until  c.c.test  =  c.c..  The  subset  of  all 
x,  y  pairs  that  make  up  &{z\ )  where  c.c.test  =  c.c  forms  the  confidence  bound. 

Ligure  4. 10  shows  ROC  confidence  bounds  based  on  this  definition  and  indicates  higher 
densities  close  to  the  ROC  extremes.  This  result  is  appropriate  because  any  ROC  curve 
has  a  correct  detection  probability  of  zero  at  false  alarm  probability  of  zero  and  a  correct 
detection  probability  of  one  at  false  alarm  probability  of  one. 


4.2  Verification  of  results 

4.2.1  Analysis  of  ROC  cun’e  and  AUC  value  bias.  The  results  that  follow  quantify  the 
confidence  band  accuracy  for  the  method  described  here  (in  Section  4.1.3)  by  considering 
repeated  runs  over  many  sets  of  samples.  Before  examining  this  accuracy,  consider  that 
ROC  curves  and  AUC  values  formed  by  fitting  beta  densities  to  beta  density  generated 
score  samples  generally  have  low  bias,  even  for  low  numbers  of  samples.  Lor  example 
(see  Ligure  4.1 1),  select  a  target  and  non-target  beta  density  pair.  Generate  30  target  and 
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Figure  4. 10  The  ROC  curve  uniform  threshold  confidence  bounds.  The  four  plots  show 
30%,  50%,  70%,  and  90%  ROC  curve  bands  formed  such  that  the  integral 
of  the  ROC  curve  density  above  a  specified  value  has  the  given  percentage 
of  unit  density  volume,  assuming  that  score  threshold  is  randomly  and  uni¬ 
formly  selected  over  all  allowed  threshold  values  (0  to  1).  Note  that  only 
the  2-D  area  of  showing  the  region  bounded  by  this  3-D  density  is  shown 
in  the  above  plot. 
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Figure  4.11  Estimates  of  ROC  curves  and  AUC  values  from  mean  and  variance  of  target 
and  non-target  beta  densities.  The  top  two  plots  show  the  underlying  beta 
target  densities  (solid  curves)  and  the  underlying  beta  non-target  densities 
(dashed  curves);  the  respective  mean  and  standard  deviation  parameters  are 
0.599,  0.021,  and  0.479,  0.023.  The  middle  left  plot  shows  the  ROC  curve 
for  the  underlying  beta  densities  (solid  curve)  with  ROC  curve  statistics  for 
300  sets  of  30  target  and  30  non-target  samples  drawn  from  each  density, 
where  the  mean  of  the  300  curves  (dash/dotted  line)  and  this  mean  plus 
and  minus  the  standard  deviations  are  plotted  (dotted  lines).  The  lower  left 
plot  similarly  shows  the  true  AUC  value,  mean  AUC  value,  and  mean  AUC 
value  plus  and  minus  the  standard  deviation  for  300  sets  of  3,  10,  30,  50, 
100,  200,  and  500  target  and  non-target  samples.  The  middle  right  and 
lower  right  plots  show  similar  results  for  the  densities  shown  in  the  up¬ 
per  right  plot,  for  which  the  target  and  non-target  densities  have  respective 
mean  and  standard  deviation  parameters  of  0.393,  0.134,  and  0.381,  0.118. 
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30  non-target  samples  from  each  density.  Fit  beta  densities  to  the  target  samples  and 
non-target  samples.  Form  a  ROC  curve  from  these  target  and  non-target  density 
estimates.  Repeat  this  process  many  times  for  many  different  sets  of  30  target  samples 
and  30  non-target  samples.  The  mean  of  the  ROC  curves  generated  approximates  the 
ROC  curve  of  the  underlying  densities.  Similarly,  the  mean  of  the  AUC  values  generated 
from  such  a  process  approximates  the  AUC  value  of  the  underlying  densities. 

Figure  4.12  illustrates  results  of  a  process  that  characterizes  the  accuracy  of  AUC  values; 
this  process  is  of  interest  for  characterizing  RSD  values.  First,  assume  a  non-target 
density.  Then,  for  each  target  density,  find  the  corresponding  AUC  value.  For  the  fixed 
non-target  density,  the  relation  of  AUC  value  to  the  mean  and  standard  deviation  of  the 
non-target  density  is  shown  in  Figure  4.12.  The  method  developed  here  is  still 
appropriate  in  the  presence  of  ROC  curve  or  AUC  value  bias  (an  analysis  of  CEG  curve 
and  RSD  value  bias,  also  included  in  this  section,  provides  further  discussion). 

4.2.2  The  ROC  curve  confidence  bounds.  The  explanation  here  largely  focuses  on 
confidence  intervals  at  selected  false  alarm  probabilities,  but  it  extends  to  confidence 
intervals  over  the  entire  ROC  curve,  which  form  confidence  contours,  and  to  the 
confidence  band  enclosed  by  the  contours.  Ideal  performance  metric  confidence 
intervals  may  achieve  two  objectives.  First,  the  stated  coverage  accuracy  of  the 
confidence  intervals  should  be  consistent  with  the  actual  coverage,  where  coverage 
accuracy  summarizes  actual  containment;  for  example,  90%  confidence  intervals  ideally 
contain  truth  with  90%  probability.  Second,  the  confidence  interval  widths  should  be  as 
small  as  possible. 

The  following  steps  evaluate  confidence  interval  accuracy  over  a  large  number  of  runs. 

1 .  Select  a  target  and  a  non-target  density  and  find  the  true  score-threshold  ROC  curve 
associated  with  these  densities.  The  true  ROC  curve  is  found  by  evaluating  the  function 
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Figure  4.12  Comparison  of  AUC  values  for  a  fixed  non-target  score  density.  Here  the 
non-target  score  density  is  fixed  at  /j  =  0.599  and  a  =  0.021.  The  plots 
show  the  effect  of  varying  the  target  density  parameters  (//  and  a)  for  the 
fixed  non-target  density  parameters.  The  top  and  bottom  plots  are  the  same 
except  for  orientation;  two  plots  are  provided  to  facilitate  comparison  with 
the  RSD  value  plots  of  Figure  4.22. 
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that  generates  the  ROC  curve  (by  varying  the  score-threshold  t)  as  described  in  Equations 
(2.5)  to  (2.10). 

2.  Generate  many  sets  of  target  and  non-target  score  samples  from  these  densities,  where 
each  set  of  samples  has  the  same  number  of  target  and  non-target  samples. 

3.  Generate  for  each  set  confidence  intervals  for  the  ROC  curve  at  each  of  uniformly 
spaced  false  alarm  probabilities. 

4.  Record  the  fraction  of  instances,  called  alpha,  where  the  truth  (i.e.,  the  true  ROC 
curve)  is  outside  of  the  confidence  intervals;  for  90%  confidence  intervals  this  fraction  is 
ideally  0.10. 

5.  Generate  a  summary  alpha  value  for  the  entire  confidence  band  by  finding  the 
percentage  of  correct  detection  probabilities  where  the  confidence  intervals  do  not 
contain  truth  for  all  false  alarm  probabilities  and  for  all  sets. 

The  Bayesian  framework  developed  here  actually  produces  confidence  intervals  that 
reflect  coverage  probability  for  particular  runs  (for  the  samples,  assumed  model,  and 
assumed  priors);  other  approaches  focus  on  confidence  interval  accuracy  only  over  a 
large  number  of  runs.  Note  that  the  steps  above  are  not  in  themselves  concerned  with 
performance  for  a  particular  run,  and  thus  these  steps  perform  a  frequentist-type 
verification  that  evaluates  "on  average"  performance  over  many  runs  (or  sets  of  target  and 
non-target  samples)  (see  [Carlin  and  Louis,  2000,  pp.  35-36]).  However,  it  is  of  interest 
to  test  the  performance  of  the  Bayesian  approach  over  a  large  number  of  runs  (as  the 
confidence  interval  results  over  one  run,  although  correct,  are  not  possible  to  verify 
numerically,  except  over  many  runs). 

The  lower  left  plot  of  Figure  4. 13  shows  that  the  observed  alpha  for  a  particular  run  can 
range  from  0  to  1.  The  summary  alpha  value  over  all  runs  for  the  example  shown  in 
Figure  4.13  is  0.09,  which  approximates  the  ideal  alpha  of  0.10. 
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Figure  4.13  Densities,  ROC  curves,  alphas,  and  coverage  for  a  selected  density  pair. 

Here  the  beta  densities  of  the  upper  left  plot  generate  30  target  and  30  non¬ 
target  samples  (the  densities  have  /./  =  0.805,  a  =  0.059,  and  /./  =  0.715, 
a  =  0.046,  respectively).  The  confidence  intervals  for  the  ROC  curve  are 
shown  at  the  lower  left.  The  upper  right  plot  shows  the  observed  alphas 
for  200  sets  of  30  target  and  30  non-target  samples,  where  the  mean  over 
many  runs  should  approach  0.10;  the  observed  mean  alpha  is  0.092.  The 
lower  right  plot  investigates  possible  bias;  results  show  the  process  to  be 
unbiased,  where  vertical  lines  are  90%  confidence  bars;  these  bars  narrow 
as  the  number  of  sets  increases. 
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Note  that  if  the  summary  alpha  value  described  above  results  in  an  ideal  alpha,  the 
confidence  intervals  of  correct  detection  probability  at  a  particular  false  alarm  probability 
are  not  necessarily  ideal.  Thus,  it  is  of  interest  to  evaluate  the  fraction  of  sets  or  runs 
where  the  separate  confidence  intervals  at  particular  false  alarm  probabilities  enclose 
truth.  The  lower  right  plot  of  Figure  4.13  provides  an  example,  where  the  straight 
horizontal  line  indicates  ideal  90%  coverage  and  the  vertical  error  bars  describe 
uncertainty  due  to  the  finite  number  of  test  runs  (as  the  number  of  test  runs  increases,  the 
length  of  each  vertical  error  bar  decreases).  The  coverage  of  each  run  is  assumed  to  be 
from  a  binomial  density;  the  figure  shows  90%  vertical  error  bars  based  on  this 
assumption.  The  process  described  above  for  developing  confidence  intervals  is  optimal 
for  the  assumed  models,  the  assumed  priors,  and  the  given  input  samples.  Thus  any 
deviation  in  the  coverage  accuracy  of  confidence  intervals  is  due  to  inapplicable  model 
density  forms  or  inapplicable  prior  densities  of  model  parameters.  Figure  4.14  provides 
an  example  for  different  underlying  target  and  non-target  densities. 

A  similar  process  is  used  to  develop  coverage  estimates  for  AUC  value  confidence 
intervals,  CEG  curve  confidence  intervals,  and  RSD  value  confidence  intervals.  Figure 
4.15  shows  the  ROC  curve  density  and  density  contours  that  corresponds  with  the 
confidence  intervals  of  Figure  4. 14.  Coverage  estimates  for  an  AUC  value  example  are 
shown  in  Figure  4.16.  The  upper  plot  shows  the  true  ROC  curve  (solid  line)  and  90% 
confidence  intervals  (dashed  line)  for  a  single  run  of  an  assumed  density  model.  The 
lower  plot  shows  the  AUC  value  estimate  (solid  curve)  and  AUC  value  90%  confidence 
intervals  (dotted  curves)  for  many  separate  runs.  The  calculated  alpha  value  is  0.0993, 
which  approximates  the  ideal  AUC  value  for  90%  confidence  intervals. 

Attempts  to  describe  coverage  accuracy  often  result  in  an  apparent  paradox.  For 
example,  assume  that  30  target  samples  and  30  non-target  samples  are  available.  Then 
form  ROC  curve  confidence  intervals  as  detailed  in  Section  3.4.  While  this  single  set  of 
samples  forms  confidence  intervals,  coverage  accuracy  estimation  requires  many  sets  of 
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Figure  4. 14  Densities,  ROC  curves,  alphas,  and  coverage  for  a  different  target  and  non¬ 
target  density  pair  (these  beta  densities  have  /i  =  0.65,  a  =  0.062,  and 
H  =  0.745,  a  =  0.043,  respectively).  This  figure  repeats  the  analysis  of 
Figure  4.13  for  a  different  target  and  non-target  density  pair. 
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Figure  4.15  A  ROC  curve  density  and  density  contours.  The  ROC  curve  density  and 
density  contours  that  correspond  with  the  confidence  intervals  of  Figure 
4.14  are  shown. 


4-35 


Figure  4.16 


Estimates  of  ROC  curves  and  AUC  value  confidence  intervals.  The  upper 
plot  shows  the  true  ROC  curve  (solid  line),  the  median  ROC  curve  (dash- 
dotted  line),  and  90%  confidence  interval  contours  (dashed  lines)  for  a  sin¬ 
gle  run  of  the  density  model  of  the  top  left  plot  of  Figure  4.13.  The  lower 
plot  shows  the  AUC  value  estimates  (solid  curve)  and  AUC  value  90%  con¬ 
fidence  intervals  (dotted  curves)  for  many  separate  runs  sorted  by  lowest 
to  highest  estimated  AUC  value.  The  straight  horizontal  line  indicates  the 
true  AUC  value  for  an  infinite  number  of  samples.  The  calculated  alpha 
value  is  0.0993,  which  approximates  the  ideal  AUC  value  of  0.10. 
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target  and  non-target  samples.  However,  if  these  sets  of  samples  are  available,  they  may 
be  concatenated  so  that  the  number  of  target  samples  and  non-target  samples  is  much 
greater  than  30.  Thus,  if  enough  information  is  known  to  test  confidence  interval 
accuracy,  then  enough  information  is  known  to  make  the  confidence  intervals 
unnecessary.  This  discussion  identifies  a  need  for  representative  test  data.  For  such  test 
data  either  the  underlying  target  and  non-target  score  sample  densities  are  known,  or  a 
very  large  number  of  target  and  non-target  score  samples  are  known  (the  latter  is  the  case 
for  the  experimental  results  of  the  following  section). 

4.2.3  ROC  curve  experimental  data  results.  Chapter  3  develops  a  Bayesian 
framework  that  generates  performance  metric  densities.  From  this  framework,  various 
descriptive  statistics  are  derived.  The  framework  and  descriptive  statistics  have  in  large 
part  been  demonstrated  with  a  beta  density  model;  however,  they  apply  to  other  density 
models,  such  as  beta  mixture  models  or  Gaussian  models.  An  example  of  this  extension 
is  described  here  using  experimental  data  from  an  actual  SUT  rather  than  data  generated 
from  assumed  underlying  target  and  non-target  densities.  The  Air  Force  Research 
Laboratory  (AFRL)  made  this  data  available  by  applying  a  mean- square/generalized 
likelihood  ratio  test  (MS  /GLRT)  algorithm  to  Moving  and  Stationary  Target  Acquisition 
and  Recognition  (MSTAR)  public  data  (see  [Bryant,  2002]). 

Figure  4.17  shows  the  experimental  target  and  non-target  data,  after  normalization  to  zero 
to  one.  The  following  procedure  selects  a  full  set  of  target  and  non-target  samples, 
starting  with  588  target  scores.  The  AFRL  data  has  nine  sets.  Sets  1,  2,  and  3  pertain  to 
SAR  images  that  all  contain  a  BMP2  vehicle.  Set  1  is  the  collection  of  196  images  of  a 
selected  BMP2  vehicle.  Set  2  is  the  collection  of  196  images  of  a  second  BMP2  vehicle. 
Set  3  is  the  collection  of  196  images  of  a  third  BMP2  vehicle.  For  each  of  the  586 
images  in  these  three  sets,  an  MS/GLRT  algorithm  has  been  applied  by  Bryant  to  obtain 
three  values  [Bryant,  2002].  The  first  value  describes  the  match  of  the  image  to  a  BMP2, 
the  second  value  describes  the  match  of  the  same  image  to  a  BTR70  (armored  personnel 
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carrier),  and  the  third  value  describes  the  match  of  the  same  image  to  a  T-72  (tank). 

Here,  only  the  first  value  is  of  interest  (because  the  BMP2  is  assumed  to  to  be  a  target), 
and  thus  196  x  3  (588)  target  scores  are  obtained.  Similarly,  784  non-target  scores  are 
obtained  as  follows.  Sets  4,  5,  6,  and  7  each  consist  of  196  images  that  contain  a  selected 
BTR70,  T-72  (tank  1),  T-72  (tank  2),  and  T-72  (tank  3),  respectively.  As  in  sets  1,  2,  and 
3,  the  MS/GLRT  algorithm  has  been  applied  to  each  set  to  obtain  three  values  (the  match 
of  the  image  to  BMP-2,  BTR70,  and  the  T-72).  Since  the  target  is  the  BMP-2,  only  the 
first  value  among  the  three  is  retained.  Thus  there  are  now  196  x  4  (784)  target  scores. 

In  addition  to  the  sets  of  3-dimensional  data  values,  AFRL  provided  code  that  assists  in 
the  above  process.  Note  that  there  are  many  options  for  obtaining  example  target  and 
non-target  samples  in  addition  to  the  method  described  above.  An  alternative  option 
takes  the  three  (BMP-2,  BTR-70,  and  T-72)  values  for  each  image  and  retains  the  highest 
among  the  three  real  number  values.  In  such  an  alternative,  an  SUT  achieves  success  as 
long  as  it  correctly  identifies  that  an  image  contained  a  weapon  system;  the  SUT  would 
not  necessarily  need  to  identify  the  specific  system. 

Sets  8  and  9  are  not  weapon  systems  (for  example,  set  9  contains  only  bulldozers). 

Initial  normalization  ensures  that  all  values  within  the  nine  sets  of  data  range  from  zero  to 
one.  Since  many  of  these  values  are  not  used  when  BMP2  is  the  assigned  target,  the  588 
target  scores  and  784  non-target  scores  have  a  narrower  range  than  zero  to  one.  The 
lowest  value  among  the  588  target  scores  and  784  target  scores  is  approximately  0.4  and 
the  highest  value  is  1.  Note  that  if  a  score  of  exactly  zero  or  exactly  one  is  tested  in  a 
beta  density  based  model,  the  posterior  density  equals  zero.  Therefore,  an  additional 
linear  transformation  is  applied  to  the  data  such  that  all  values  within  the  nine  sets  of  data 
have  an  upper  limit  of  0.95  and  a  lower  limit  of  0.05. 

Here  two  comparison  processes  estimate  the  ROC  curve  densities  and  generate  ROC 
curve  confidence  intervals.  The  first  process  applies  a  single  beta  density  model.  The 
second  process  applies  a  two-beta  mixture  density  model.  Note  that  in  the  two-beta 
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density  model,  the  number  of  target  and  non-target  grid  points  required  is  large;  an 
exhaustive  iterative  search  over  uniform  means  and  uniform  standard  deviations  is  not 
used.  Instead,  grid  points  are  selected  in  a  uniform,  random  manner  over  all  allowable 
means,  standard  deviations,  and  ratios,  such  that  an  example  two-beta  density  is  fully 
defined  by  two  means,  two  standard  deviations,  and  one  ratio.  The  ratio  shows  the 
relative  weighting  of  the  two  beta  densities  that  comprise  the  two-beta  density  model. 

Figures  4.18  and  4.19  show  the  results.  Note  that  since  the  underlying  densities  are  not 
known,  the  experimental  data  coverage  accuracies  and  alphas  are  not  expected  to  be  as 
ideal  as  in  the  examples  of  previous  sections.  Figure  4.18  assumes  a  single  beta  model 
for  the  data.  Many  sets  of  30  target  samples  and  30  non-target  samples  are  drawn  from 
the  588  target  scores  and  784  non-target  scores,  and  the  assumed  truth  is  the  ROC  curve 
formed  by  all  1372  scores. 

The  figure  shows  confidence  intervals  developed  for  one  run  of  30  target  and  30 
non-target  samples  (drawn  from  the  588  target  scores  and  784  non-target  scores)  and 
coverage  accuracy  based  on  105  such  sets.  Note  that  the  ideal  mean  alpha  is  0.1,  and  the 
observed  alpha  is  0.2359.  Figure  4.19  applies  a  two-beta  mixture  model  to  the  same 
process.  The  two  beta  mixture  model  has  5  parameters  (two  means,  two  standard 
deviations,  and  a  ratio  of  the  two  beta  densities).  The  mean  alpha  for  this  bimodal 
two-beta  density  mixture  model  is  0.1038  ~  0.10,  which  improves  the  single  beta  model 
results. 

The  lower  left  plots  of  both  Figure  4.18  and  4.19  show  confidence  intervals  developed  by 
the  single  beta  models  and  the  two-beta  mixture  models  for  the  same  set  of  30  target  and 
30  non-target  samples.  The  upper  left  plots  of  the  figures  use  the  same  set  of  target  and 
non-target  samples,  and  the  plots  show  the  target  and  non-target  densities  that  correspond 
to  the  ROC  curve  with  the  highest  posterior  density  or  weight  (see  Figure  3.8).  Even 
though  the  target  and  non-target  densities  of  the  highest  posterior  density  for  the  single 
beta  density  model  do  not  appear  to  be  of  the  same  form  as  Figure  4.17,  the  ROC  curve 
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Figure  4. 17  Experimental  target  and  non-target  score  histograms.  Based  on  a  subset  of 
data  from  AFRL/SN  [Bryant,  2002],  the  confidence  interval  development 
process  (see  Figures  4.5  and  4. 13)  is  applied  to  the  experimental  data  shown 
above.  A  single  beta  density  model  is  applied  to  this  data  in  Figure  4.18, 
and  a  two-beta  mixture  density  model  is  applied  in  Figure  4.19.  Note  that 
a  beta  density  model  requires  scaling  of  the  data  (since  the  data  here  must 
range  from  0  to  1). 
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Figure  4.18  Densities,  ROC  curves,  alphas,  and  coverage  for  30  target  and  30  non-target 
samples  generated  from  the  experimental  data  shown  in  Figure  4.17  and  a 
single  beta  model.  The  data  of  Figure  4.17  is  scaled  for  a  maximum  range 
of  0.05  to  0.95  rather  than  0  to  1  (see  text). 
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Figure  4. 19  Densities,  ROC  curves,  alphas,  and  coverage  for  30  target  and  30  non-target 

samples  generated  from  the  experimental  data  shown  in  Figure  4.17  and  a 
two  beta  mixture  model. 
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confidence  intervals  appear  reasonable.  This  difference  emphasizes  the  benefit  of  a 
Bayesian  approach.  Further,  the  two-beta  mixture  model  appears  to  have  better  coverage 
accuracy  over  all  false  alarm  probabilities,  showing  that  more  complex  models  may  be  of 
benefit  when  the  density  form  is  not  known;  such  as  in  this  experimental  data  (for 
example,  note  the  small  but  significant  number  of  target  samples  between  0.5  and  0.7, 
and  the  small  but  significant  number  of  non-target  samples  between  0.4  and  0.6). 

Note  that  the  comparison  "truth"  is  actually  an  estimate  of  truth  as  it  consists  of  only  588 
target  scores  and  784  non-target  scores.  These  numbers  seem  large  enough  to 
approximate  truth,  but  there  is  uncertainty  (see  Figures  3.2  and  3.3,  and  related 
discussion).  This  result  also  emphasizes  the  importance  of  incorporating  knowledge  of 
the  actual  underlying  model,  if  known.  Mac  Kay  [Mac  Kay,  2003]  discusses  the  related 
concept  of  importance  sampling,  which  provides  the  option  of  using  a  simpler  model 
even  when  it  is  known  that  a  more  complex  model  is  truth. 

Additional  implementation  choices  exist.  An  option  is  to  change  the  scaling  of  the  data. 
If  the  data  were  scaled  from  0.1  to  0.9  rather  than  0.05  to  0.95,  the  scaling  may  impact 
coverage  accuracy.  An  example  for  the  single  beta  density  case  for  0.1  to  0.9  scaling  is 
shown  in  Figure  4.20.  For  this  example,  the  change  in  scaling  has  minimal  impact  on  the 
results. 

The  results  presented  here  show  the  ability  of  the  framework  to  evaluate  experimental 
data.  This  results  presented  here  do  not  imply  that  the  two-beta  density  mixture  model 
will  always  have  results  that  improve  a  single  beta  model.  The  single  beta  density 
framework  (and  two-beta  density  mixture  model  extension)  have  been  introduced  in  this 
research  as  examples  to  test  the  framework  developed  in  Chapter  3.  Detailed  approaches 
regarding  the  appropriate  incorporation  of  more  complex  models  are  presented  in  future 
work. 
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Figure  4.20  Same  as  Figure  4. 18,  except  that  the  experimental  sample  values  are  scaled 
for  a  maximum  range  of  0. 1  to  0.9. 
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4.2.4  Analysis  ofCEG  curve  and  RSD  value  bias.  The  results  that  follow  in  Section 

4.2.5  quantify  CEG  curve  confidence  band  accuracy  by  repeated  runs  over  many  sets  of 
samples.  Before  examining  this  accuracy,  consider  that  RSD  values  formed  by  fitting 
beta  densities  to  beta  density  generated  data  can  have  a  higher  bias  than  AUC  values, 
particularly  for  low  numbers  of  samples.  For  example  (see  Figure  4.21),  select  a  target 
and  non-target  beta  density  pair.  Generate  30  target  and  30  non-target  samples  from  each 
density.  Fit  beta  densities  to  the  30  target  and  30  non-target  samples  by  matching  sample 
and  density  mean  and  variance.  Form  a  CEG  curve  and  RSD  value  from  these  two  beta 
density  estimates.  Repeat  this  process  many  times  for  many  different  sets  of  30  target 
samples  and  30  non-target  samples.  The  mean  RSD  value  generated  from  this  process 
may  be  consistent  with  the  RSD  value  of  the  underlying  densities.  Note  that  the  CEG 
curve  estimates  exhibit  a  slight  bias,  but  the  standard  deviation  is  wide. 

In  Figure  4.22  a  non-target  density  is  assumed,  then  the  RSD  value  is  found  for  many 
target  beta  densities.  If  truth  is  at  the  minimum  of  the  "bowl"  shown,  then  the  verification 
process  that  was  used  for  AUC  values  is  not  appropriate  for  RSD  values  (compare  Figure 
4.22  with  Figure  4.12).  However,  RSD  values  developed  here  are  appropriate:  given  an 
assumed  model  of  beta  densities  for  target  and  non-target  and  given  target  and  non-target 
samples,  90%  correct  confidence  intervals  for  RSD  values  can  be  generated.  These 
confidence  intervals  are  correct,  although  they  may  not  enclose  the  truth  for  90%  of  mns. 

The  verification  issue  noted  here  may  be  illustrated  as  follows.  Suppose  1000  students 
take  a  test  of  100  questions.  It  is  known  (as  a  prior)  that  999  of  the  students  answer  80 
questions  correctly  and  one  student  answers  95  questions  correctly.  An  evaluator  is 
aware  of  this  information  and  obtains  10  test  questions  from  a  randomly  selected  student. 
Unknown  to  the  evaluator,  the  selected  student  is  the  student  who  answers  95  questions 
correctly.  The  evaluator  is  to  provide  90%  confidence  intervals  for  the  number  of 
questions  that  the  student  answers  correctly.  Based  on  the  priors,  the  evaluator  specifies 
the  upper  and  lower  90%  confidence  intervals  at  80  questions  correct.  This  process  is 
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Figure  4.21  Estimates  of  CEG  curves  and  RSD  values.  The  top  two  plots  show  un¬ 
derlying  target  densities  (solid  curves)  and  underlying  non-target  densities 
(dashed  curves).  The  middle  left  plot  shows  the  CEG  curves  for  the  un¬ 
derlying  beta  densities  (solid  curve)  with  CEG  curve  statistics  for  300  sets 
of  30  target  and  30  non-target  samples  drawn  from  each  density  shown  in 
the  top  left  plot,  where  the  mean  of  the  300  curves  (dash/dotted  line)  and 
this  mean  plus  and  minus  the  standard  deviations  are  plotted  (dotted  lines). 
The  lower  left  plot  similarly  shows  the  true  RSD  value,  mean  RSD  value, 
and  mean  RSD  value  plus  and  minus  the  standard  deviation  for  300  sets  of 
3,  10,  30,  50,  100,  200,  and  500  target  and  non-target  samples.  The  middle 
right  and  lower  right  plots  show  similar  plots  for  the  densities  in  the  upper 
right  plot. 
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Figure  4.22  The  RSD  values  for  a  fixed  non-target  density.  Here  the  non-target  density 
is  constant  and  the  target  density  varies  over  the  full  range  of  possible  beta 
parameters.  Note  the  bowl  appearance,  where  RSD  approaches  a  mini¬ 
mum  at  mean  of  0.6  and  standard  deviation  of  0.1.  If  the  true  density 
has  the  minimum  RSD  value,  then  the  RSD  confidence  intervals  developed 
for  small  set  of  samples  do  not  enclose  truth  because  uniform  priors  over 
mean  and  standard  deviation  are  assumed.  The  confidence  intervals  are 
reasonable  even  though  they  are  not  necessarily  appropriate  in  the  standard 
coverage  accuracy  test  used  for  the  CEG  curve,  ROC  curve,  and  AUC  value 
confidence  intervals.  The  two  plots  are  the  same  except  for  orientation. 
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repeated  many  times.  No  matter  how  many  sets  of  10  questions  are  provided,  when  each 
set  is  considered  individually,  the  confidence  intervals  will  never  enclose  the  truth  of  "95 
questions  correct". 

4.2.5  The  CEG  curve  confidence  bounds.  Figure  4.23  is  similar  to  Figure  4. 13, 
except  that  the  performance  metric  examined  is  the  CEG  curve  rather  than  the  ROC 
curve.  Using  the  accuracy  description  of  alpha,  as  with  the  ROC  curve,  CEG  curve 
confidence  interval  development  is  shown  to  be  accurate  for  the  assumed  model  and 
priors.  Note  that  this  figure  is  representative  of  CEG  curve  results;  similar  plots  with 
sample  sizes  of  10,  30,  100,  and  200  have  been  tested  with  similar  results  (and  with  an 
additional  underlying  density  for  which  the  CEG  curve  is  near  the  45  degree  line).  The 
results  are  significant  because  whereas  the  ROC  curve  confidence  interval  process 
described  here  is  an  improvement  over  existing  techniques,  the  CEG  curve  confidence 
interval  specification  process  is  without  precedent.  The  results  also  demonstrate  the 
general  extensibility  of  the  entire  Bayesian  framework  to  performance  metrics  other  than 
the  ROC  curve.  Figure  4.24  shows  an  additional  example  using  different  underlying 
target  and  non-target  densities. 

The  verification  processes  that  are  applied  here  in  Chapter  4  will  be  used  to  assist  in 
comparisons  with  the  literature  in  Chapter  5. 
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Figure  4.23  The  alpha  metric  for  a  CEG  curve.  Here  the  underlying  densities  shown  at 
the  upper  left  generate  30  target  and  30  non-target  samples  (the  beta  densi¬ 
ties  have  /i  =  0.599,  a  =  0.021,  and  //  =  0.479,  a  =  0.023,  respectively). 
Confidence  intervals  for  the  corresponding  CEG  curve  are  shown  in  the 
lower  left  with  the  median  CEG  curve  and  the  true  CEG  curve.  The  upper 
right  plot  shows  the  observed  alphas  for  264  sets  of  30  target  and  30  non¬ 
target  samples,  where  the  mean  over  many  runs  should  approach  0. 10;  the 
observed  mean  alpha  is  0.1035.  The  lower  right  plot  investigates  possible 
bias;  results  show  the  process  to  be  unbiased,  where  vertical  lines  are  90% 
confidence  bars;  these  bars  narrow  as  the  number  of  sets  increases. 
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Figure  4.24  The  CEG  curve  confidence  intervals  for  a  single  run  and  coverage  accuracy 
over  many  runs.  The  left  plot  shows  90%  confidence  intervals  developed 
for  30  target  samples  and  30  non-target  samples  (the  underlying  densities 
are  the  same  as  in  the  right  plots  of  Figures  4.1 1  and  4.21).  The  right  plot 
shows  the  percent  coverage  of  confidence  intervals  produced  for  247  runs, 
where  each  run  repeats  the  process  used  to  generate  the  left  plot. 
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5.  Quantitative  Comparisons 


In  this  chapter  quantitative  comparisons  are  made  with  methods  described  in  the 
literature  review  of  Chapter  2.  First,  the  Metz  method,  which  was  discussed  extensively 
in  the  first  part  of  the  literature  review  section  of  Chapter  2,  is  now  reviewed  and 
compared  qualitatively  and  quantitatively  with  the  method  developed  here.  Then  other 
methods  are  also  reviewed  and  compared.  Here,  coverage  accuracy  and  alpha  (as 
described  in  Chapter  4)  are  used  to  quantify  the  accuracy  of  the  confidence  intervals  of 
the  method  developed  here  with  other  available  methods  in  the  literature.  These  metrics 
provide  tools  for  comparing  the  accuracy  of  the  developed  confidence  intervals  among 
various  ROC  uncertainty  estimation  methods. 

5.1  Comparison  with  Metz  confidence  interval  method 

Figure  5.1  compares  the  Metz  method  [Metz  et  al.,  1998]  with  the  method  developed 
here.  This  evaluation  uses  the  software  package  ROCKIT  to  execute  the  Metz  method. 
Beta  densities  generate  30  target  and  30  non-target  samples.  Many  runs  repeat  this 
sample  generation  process,  where  each  run  selects  a  new  set  of  30  target  and  30 
non-target  samples.  Application  of  the  confidence  interval  calculation  method  developed 
here  (see  Section  4.1.4)  generates  unique  ROC  curve  confidence  intervals  for  each  run. 
Confidence  band  coverage  area  evaluation  and  alpha  (coverage  accuracy)  evaluation 
reveal  clear  advantages  of  the  method  developed  here  over  the  Metz  method.  For  many 
runs  of  30  target  and  30  non-target  samples,  the  coverage  accuracy  may  be  evaluated  and 
averaged  over  all  false  alarm  probabilities.  For  120  such  runs,  the  method  developed 
here  is  51%  closer  to  the  ideal  alpha  of  0.05  (for  95%  confidence  intervals)  over  the  range 
of  the  ROC  curve.  Recall  that  larger  confidence  band  coverage  area  without  improved 
coverage  accuracy  implies  less  useful  results.  Again  analyzing  the  120  repeated  runs  of 
30  target  and  30  non-target  samples,  the  Metz  method  has  16%  larger  confidence  band 
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Figure  5.1  Alpha  and  confidence  interval  lengths  for  the  Metz  [Metz  et  al.,  1998] 
method  and  the  method  developed  here.  Both  methods  develop  95%  con¬ 
fidence  intervals.  Beta  densities  with  target  mean  of  0.805,  target  standard 
deviation  of  0.059,  non-target  mean  of  0.715,  and  non-target  standard  devi¬ 
ation  of  0.805  generate  30  target  samples  and  30  non-target  samples  many 
times.  Note  that  the  Metz  method  appears  to  be  slightly  closer  to  the  ideal 
alpha  than  the  method  developed  here  between  false  alarm  probability  values 
of  0  and  0.02  and  0.11  and  0.13,  which  is  not  necessarily  advantageous  be¬ 
cause  the  method  developed  here  has  greater  coverage  (approximately  97%) 
combined  with  significantly  shorter  interval  lengths  (21%  shorter  at  a  false 
alarm  probability  of  0.01,  for  example)  at  these  values.  A  similar  argument 
applies  for  false  alarm  probability  values  between  0.25  and  0.4,  as  the  confi¬ 
dence  interval  lengths  of  the  two  methods  are  nearly  identical,  and  the  Metz 
method  has  wider  coverage.  For  the  smallest  possible  confidence  interval 
widths  that  maintain  at  least  (1 -alpha)  coverage,  the  method  developed  here 
outperforms  the  Metz  method  for  every  false  alarm  probability. 


5-2 


area  than  the  approach  developed  here,  and  Metz  has  larger  coverage  for  the  full  range  of 
critical  false  alarm  probability  values  between  0  and  0.2.  For  the  smallest  possible 
confidence  interval  widths  with  at  least  (1 -alpha)  coverage,  the  method  developed  here 
outperforms  Metz  at  every  false  alarm  probability.  Note  that  in  contrast  to  the  Metz 
method,  the  method  developed  here  requires  no  assumptions  about  the  shape  of  the  ROC 
curve,  which  is  important  because  for  target  detection  system  evaluation  it  is  not 
appropriate  to  presuppose  the  shape.  Comparing  the  top  and  bottom  plots  of  Figure  5.1, 
note  that  there  is  a  false  alarm  probability  (near  0.2),  where  the  Metz  method  has  a  higher 
alpha  than  the  method  developed  here,  but  also  has  a  larger  confidence  interval  length 
than  the  method  developed  here.  These  results  are  reasonable  because  confidence 
interval  length  does  not  indicate  whether  or  not  the  length  is  over  the  appropriate  range  of 
correct  detection  probabilities. 

The  Metz  method  does  not  allow  for  ready  incorporation  of  prior  assumptions  to  refine 
the  ROC  curve  uncertainty  estimates.  The  choice  of  a  generally  convex  ROC  curve  (if 
only  unintentionally)  becomes  a  choice  of  a  prior.  Some  adjustment  or  weighting  of  the 
covariance  terms  of  the  binormal  approach  could  change  the  standard  error,  but  Metz 
does  not  discuss  such  adjustment.  The  method  developed  here  permits  the  ready 
incorporation  of  target  and  non-target  parameter  priors,  and  it  may  be  easily  extended  to 
any  density  form. 

Figure  5.2  shows  the  ROC  curve  and  example  associated  confidence  bands  for  the 
example  of  Figure  5.1.  Figure  1.3  has  already  revealed  that  the  confidence  intervals  for 
the  Metz  approach  can  result  in  a  significantly  larger  confidence  band  area  than  the 
confidence  band  area  for  the  method  developed  here. 

Comparison  with  the  Metz  method  makes  clear  significant  weaknesses  in  the  ability  of 
the  Metz  method  to  adapt  to  curve  forms  that  are  not  concave.  This  comparison  shows 
that  the  Metz  method  is  inferior  in  confidence  interval  coverage  accuracy  and  confidence 
band  area  compared  with  the  method  developed  here.  However,  even  disregarding  these 
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Figure  5.2  Comparison  of  ROC  curve  and  confidence  intervals.  Here  30  target  and  30 
non-target  samples  are  drawn  from  beta  densities  for  which  the  solid  curve  is 
the  true  ROC  curve  for  an  infinite  set  of  samples  (the  target  mean  is  0.7 15,  the 
target  standard  deviation  is  0.01,  the  non-target  mean  is  0.715,  and  the  non¬ 
target  standard  deviation  is  0.046).  The  90%  confidence  interval  contours 
for  the  method  developed  here  and  the  Metz  method  are  shown.  Figure  5.1 
reports  the  coverage  accuracy  and  confidence  interval  widths  for  many  runs, 
and  the  plot  shown  here  gives  one  example  of  such  a  run. 
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disadvantages,  the  Metz  approach  does  not  apply  to  the  confidence  error  generation 
(CEG)  curve  or  other  performance  metrics  where  the  assumed  form  of  the  performance 
metric  curve  is  not  a  straight  line  in  normal  deviate  space. 

5.2  Comparison  with  Zhou  confidence  interx’al  method 

The  literature  considers  various  ROC-curve  bootstrap  approaches,  i.e.,  methods  that 
generate  confidence  bounds  using  subsets  of  the  available  target  and  non-target  samples. 
This  section  examines  the  most  recent  approach,  Zhou  [Zhou  and  Qin,  2005],  who 
obtains  results  that  improve  upon  the  bootstrap  results  of  Platt  [Platt  et  al.,  2000].  A 
general  advantage  for  bootstrap  methods  is  that  they  make  no  assumptions  about  the  form 
of  the  densities  (such  as  assuming  a  beta  density).  Both  Platt  and  Zhou  claim  reasonable 
coverage  accuracies  for  95%  confidence  intervals  of  correct  detection  probability  at  false 
alarm  probabilities  of  0.1  and  0.2;  Zhou  claims  smaller  confidence  interval  widths. 

Zhou  develops  two  new  bootstrap-based  approaches;  the  approach  that  Zhou  regards  as 
optimal  is  used  here  for  comparison.  In  discussing  Platt’s  work,  Zhou  points  out 
disadvantages  of  bootstrap  methods,  such  as  the  high  number  of  target  and  non-target 
samples  necessary  for  accurate  results.  Zhou  claims  that  a  binomial  correction  factor 
improves  bootstrap-based  results,  particularly  at  low  numbers  of  samples.  He  considers 
multiple  examples  with  20  target  samples  and  20  non-target  samples,  whereas  Platt’s 
research  focuses  on  100  target  samples  and  100  non-target  samples. 

Zhou’s  paper  only  considers  results  at  false  alarm  probabilities  of  0.1  and  0.2.  Figure 
5.3,  which  corresponds  with  Zhou’s  example  2  and  3,  uses  Zhou’s  method  but  develops 
confidence  intervals  for  other  false  alarm  probabilities.  At  false  alarm  probabilities  of 
0.1  and  0.2,  coverage  accuracies  similar  to  Zhou’s  results  are  obtained  (see  the  top  right 
plot  of  Figure  5.3).  The  confidence  interval  widths  are  also  consistent  with  Zhou’s 
findings.  As  Zhou  and  Platt  both  focus  only  on  false  alarm  probabilities  of  0. 1  and  0.2,  a 
key  concern  is  whether  or  not  confidence  intervals  are  accurate  over  other  false  alarm 
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Figure  5.3  Confidence  intervals  for  one  run  of  the  Zhou  [Zhou  and  Qin,  2005]  method, 
coverage  accuracy  for  many  runs,  and  comparisons  with  the  method  devel¬ 
oped  here.  Zhou  examines  false  alarm  probabilities  of  0.1  and  0.2,  and  his 
work  is  extended  here  to  the  full  range  of  false  alarm  probabilities.  The 
top  left  plot  shows  a  representative  ROC  curve  with  confidence  intervals  for 
the  Zhou  approach  with  20  target  and  20  non-target  samples.  The  top  right 
plot  shows  the  percent  coverage  of  the  Zhou  method  for  1700  runs  with  90% 
coverage  vertical  confidence  bars.  The  lower  two  plots  compare  116  runs 
for  the  method  developed  here.  Note  that  in  contrast  to  Zhou,  the  method 
developed  here  produces  smooth  confidence  bands  and  ROC  curves,  and  the 
coverage  is  consistent  over  the  full  range  of  false  alarm  probabilities. 
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probabilities.  Examining  the  top  right  plot  of  Figure  5.3,  the  ROC  curve  for  a  density 
pair  that  Zhou  selects  deviates  considerably  from  the  ideal  95%  coverage  between  a  false 
alarm  probability  of  0.3  and  0.88.  Recall  that  bootstrap  methods  rely  only  on  the 
observed  samples  (rather  than  estimates  of  densities).  If  the  underlying  density  that 
generates  the  samples  is  relatively  small  at  a  particular  score,  the  corresponding  correct 
detection  probabilities  at  that  score  are  difficult  to  estimate  with  a  bootstrap  method. 
Figures  5.4,  5.5,  and  5.6  show  the  underlying  densities  that  Zhou  uses  as  examples, 
results  of  the  Zhou  method,  and  a  comparison  with  the  method  developed  here.  In 
contrast  to  Zhou  and  other  bootstrap  methods,  the  method  developed  here  has  appropriate 
coverage  accuracy  over  the  entire  range  of  the  ROC  curve. 


5.3  Comparison  with  Hall  confidence  interval  method 

Hall  [Hall  et  al.,  2004]  uses  a  kernel-based  approach  to  form  confidence  intervals.  They 
use  an  updated  bandwidth  calculation  approach  that  extends  previous  kernel-based 
approaches.  The  method  they  develop  requires  use  of  10  different  smoothing  parameters 
to  set  different  bandwidths.  They  report  coverage  accuracy  results  where  samples  are 
generated  repeatedly  from  assumed  underlying  densities  and  report  results  for  100  target 
samples  and  100  non-target  samples.  These  results  appear  to  be  generally  accurate, 
except  at  the  extremes  of  false  alarm  probability,  where  the  coverage  accuracy  often 
declines.  This  result  is  of  concern,  as  very  low  false  alarm  probabilities  are  often  of 
particular  interest;  however,  adequate  coverage  accuracy  over  the  full  range  of  false 
alarm  probabilities  is  important  as  indicated,  for  example,  in  the  SUT  A  and  SUT  B 
example  of  Chapter  1.  (Of  course,  if  it  is  known  a  priori  that  the  only  false  alarm 
probability  of  interest  is  a  false  alarm  probability  of  0.5,  then  the  Hall  method  performs 
well  for  the  examples  reported  by  Hall.)  Figures  5.7  and  5.8  show  a  comparison  of  the 
method  developed  here  and  two  of  the  Hall  examples.  The  weaknesses  that  the  Hall 
method  can  have  at  the  extremes  of  false  alarm  probabilities  is  apparent  in  the  Figures. 
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Figure  5.4  Underlying  densities  for  examples  used  to  compare  with  the  Zhou 
[Zhou  and  Qin,  2005]  method.  Zhou  selects  the  above  beta  densities,  and 
these  densities  generate  target  and  non-target  samples.  The  solid  lines  are 
target  and  the  dotted  lines  are  non-target.  Note  that  examples  2  and  3  are 
combined  because  Zhou  uses  the  same  underlying  densities  for  two  exam¬ 
ples  (they  examine  false  alarm  probabilities  of  0.1  and  0.2). 
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Figure  5.5  Coverage  accuracy  for  Zhou  [Zhou  and  Qin,  2005]  confidence  bounds.  The 
plots  show  the  percent  coverage  of  confidence  bounds  for  each  of  the  four 
density  pairs  of  Figure  5.4.  Note  that  Zhou  only  examines  false  alarm  prob¬ 
abilities  of  0.1  and/or  0.2,  so  examples  2  and  3  have  identical  underlying 
densities.  These  plots  are  similar  to  the  top  right  plot  of  Figure  5.3,  except 
that  three  additional  examples  are  shown,  where  1700  sets  of  20  target  sam¬ 
ples  and  20  non-target  samples  are  the  inputs. 
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Figure  5.6  Percent  coverage  of  comparison  bounds  for  the  method  developed  here.  The 
plots  show  the  percent  coverage  of  confidence  bounds  for  each  of  the  four 
density  pairs  of  Figure  5.4  using  the  method  developed  here  based  upon  sets 
of  20  target  and  20  non-target  samples.  Zhou  considers  only  false  alarm 
probabilities  of  0.1  and/or  0.2.  These  plots  are  similar  to  Figure  5.3,  except 
that  three  additional  examples  are  shown. 
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Figure  5.7  uses  normal  target  and  non-target  densities,  and  Figure  5.8  shows  beta  target 
and  non-target  densities.  For  the  method  developed  here,  the  normal  target  and 
non-target  densities  first  generate  samples,  then  these  samples  are  transformed  so  that  the 
greatest  value  among  the  target  and  non-target  samples  is  0.95  and  the  lowest  value 
among  the  target  and  non-target  samples  is  0.05.  In  addition  to  comparing  favorably 
with  the  Hall  approach,  the  example  of  Figure  5.7  indicates  that  the  method  developed 
here  is  flexible  to  changes  in  assumed  densities. 

5.4  Comparison  with  Hilgers  confidence  interval  method 

Figure  5.9  shows  confidence  intervals  based  on  the  Hilgers  [Hilgers,  1991]  binomial 
method.  The  Hilgers  method  is  similar  to  the  current  AFRL  ROC  curve  confidence 
interval  estimation  approach.  The  coverages  (95%  is  the  objective  in  the  above  case) 
tend  to  be  too  conservative,  and  the  resulting  confidence  intervals  are  too  wide  (see 
discussion  in  [Schafer,  1994]).  The  method  developed  here  provides  a  smoother  estimate 
of  the  ROC  curve  (dash/dotted  line)  than  the  Hilgers  method,  and  more  significantly  it 
produces  much  narrower  confidence  intervals,  particularly  for  low  numbers  of  samples. 

The  Hilgers  approach  uses  a  binomial-based  ordered  statistics  approach  and  finds  95% 
error  bars  in  correct  detection  probability  and  false  alarm  probability  at  a  selected 
threshold.  The  resulting  rectangular  region  then  combines  two  error  bars  using  the 
following  procedure.  First,  it  finds  a  best-case  upper  confidence  band  point  for  this 
threshold  as  the  minimum  false  alarm  probability  and  maximum  correct  detection 
probability  within  the  rectangular  region.  Second,  it  finds  a  worst-case  lower  confidence 
band  point  for  this  threshold  as  the  maximum  false  alarm  probability  and  minimum 
correct  detection  probability  within  the  region.  Finally,  it  repeats  for  all  thresholds  and 
combines  results  to  obtain  a  lower  confidence  interval  contour  and  an  upper  confidence 
interval  contour.  This  process  generates  a  95%  ROC  curve  confidence  band.  Although 
bands  obtained  by  this  process  enclose  at  least  95%  of  the  true  ROC  curve,  the  bands  are 
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Figure  5.7  The  ROC  curve  confidence  interval  coverage  accuracies  for  the  Hall 
[Hall  et  al. ,  2004]  method  and  the  method  developed  here  for  normal  target 
and  non-target  densities.  Normal  target  and  non-target  densities  generate 
100  target  samples  and  100  non-target  samples.  This  process  is  repeated 
many  times  to  determine  coverage  accuracy.  The  target  density  has  a  mean 
of  one,  the  non-target  density  has  mean  of  zero,  and  both  densities  have  unit 
variance.  The  plot  at  left  shows  Hall’s  coverage  accuracy  at  selected  false 
alarm  probability  for  1000  sets  of  samples.  The  plot  at  right  shows  a  similar 
graph  for  the  method  developed  here,  with  90%  vertical  confidence  bars  for 
208  sets  of  samples  (90%  vertical  bars  show  uncertainty  due  to  the  lower 
number  of  runs).  Hall’s  coverage  accuracy  is  generally  accurate,  except  as 
false  alarm  probability  approaches  zero  or  one.  This  inaccuracy  is  a  weak¬ 
ness  in  the  Hall  approach,  because  often  the  most  significant  false  alarm 
probabilities  are  near  zero. 
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Figure  5.8  The  ROC  curve  confidence  interval  coverage  accuracies  for  the  Hall 
[Hall  et  al.,  2004]  method  and  the  method  developed  here  for  beta  target  and 
non-target  densities.  Beta  target  and  non-target  densities  generate  100  target 
samples  and  100  non-target  samples.  This  process  is  repeated  many  times 
to  determine  coverage  accuracy.  For  the  target  density  the  beta  parameters 
are  a  =  2  and  b  =  4,  and  for  the  non-target  density  they  are  a  =  2  and  6  =  3. 
These  figures  otherwise  use  the  same  process  as  Figure  5.7.  The  left  plot 
shows  Hall’s  results  and  the  right  plot  shows  results  of  the  method  developed 
here. 
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Figure  5.9  Comparison  with  the  Hilgers  [Hilgers,  1991]  binomial  method.  The  method 
uses  techniques  similar  to  the  current  AFRL  approach  for  generating  ROC 
curve  confidence  interval  estimates.  The  top  plot  shows  the  95%  confidence 
intervals  for  the  Hilgers  method.  These  intervals  cover  the  stated  confidence 
interval  region,  but  the  confidence  intervals  are  too  wide  [Schafer,  1994]. 
The  bottom  plot  shows  (also  for  95%  confidence  intervals)  that  the  approach 
developed  here  provides  a  smoother  estimate  of  the  ROC  curve  (dash/dotted 
line),  and,  more  significantly,  it  produces  much  narrower  confidence  inter¬ 
vals. 
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conservative  in  that  they  are  typically  larger  than  necessary.  Note  that  such  a  band  is  less 
informative  than  a  band  with  smaller  confidence  band  area  provided  that  both  bands  have 
at  least  the  stated  coverage  (95%  in  this  case).  The  top  plot  of  Figure  5.9  shows  Hilgers’ 
results  for  30  target  samples  and  30  non-target  samples  obtained  using  Medcalc  statistical 
software  (commercially  available  software  that  implements  Hilgers’  approach  in  a  2005 
update).  The  bottom  plot  shows  a  much  narrower  confidence  band  for  the  same  samples 
obtained  using  the  method  developed  here.  In  addition  to  the  larger  band  width,  the 
Hilgers  approach  also  has  a  general  disadvantage  in  that  the  rectangular  region 
connection  that  forms  the  confidence  band  is  generated  by  an  ad-hoc  method. 

The  results  demonstrate  the  robustness  of  the  method  developed  here  when  the  overall 
model  density  form  assumptions  are  correct.  The  method  developed  here  is  expected  to 
improve  ROC  confidence  interval  results  compared  with  other  approaches  in  most  cases. 
The  method  developed  here  provides  a  flexible  and  robust  framework  by  which  target  and 
non-target  samples,  model  assumptions,  and  prior  densities  can  be  incorporated. 

5.5  Additional  considerations 

In  determining  which  ROC  confidence  interval  approach(es)  are  appropriate,  sample  size 
and  knowledge  of  the  density  model  form  are  important  factors  to  consider.  The 
following  provides  a  few  scenarios. 

Large  numbers  of  samples  are  available  and  there  is  no  prior  knowledge  of  target  and 
non-target  density  form.  Bootstrap  methods  may  be  acceptable.  For  example,  the 
bootstrap  method  of  Zhou  [Zhou  and  Qin,  2005]  may  be  acceptable,  if  a  large  number 
(more  than  100)  of  target  and  non-target  scores  are  available,  and  if  the  form  of  the  target 
and  non-target  scores  are  not  known,  but  are  thought  to  be  non-normal  and  non-beta. 
Figure  5.10  is  similar  to  the  Zhou  method  (Example  2/3)  of  Figure  5.5,  except  that  rather 
than  20  target  and  20  non-target  samples,  various  numbers  of  samples  are  shown.  Note 
that  while  the  coverage  accuracy  improves  for  increased  number  of  samples,  a  large 
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Figure  5.10  Coverage  accuracy  for  Zhou  confidence  bounds  for  various  numbers  of  tar¬ 
get  and  non-target  samples  for  a  beta  density  model.  The  plots  shown  are 
the  same  as  the  bottom  left  plot  of  Figure  5.5,  except  that  instead  of  20 
target  and  20  non-target  samples,  the  number  of  samples  is  increased  to  40 
target  and  40  non-target  samples,  80  target  and  80  non-target  samples,  etc. 
Note  that  while  the  coverage  accuracy  does  improve  for  increased  number 
of  samples,  a  large  number  of  samples  can  be  required  for  to  achieve  good 
coverage  accuracy. 
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number  of  samples  may  be  required  to  achieve  good  coverage  accuracy.  Coverage 
accuracy  depends  on  the  target  and  non-target  density  being  evaluated.  An  additional 
example  is  shown  in  Figure  5.11  where  the  Zhou  method  forms  confidence  intervals 


Figure  5.11  Coverage  accuracy  for  Zhou  confidence  bounds  for  a  normal  density  model. 

The  plots  shown  use  samples  generated  from  the  same  underlying  densities 
as  Figure  5.7.  Here  runs  of  50  target  and  50  non-target  samples  and  100  tar¬ 
get  samples  and  100  non-target  samples  are  evaluated.  The  Zhou  bootstrap 
method  is  used  to  obtain  the  displayed  confidence  intervals. 


based  on  the  samples  generated  from  underlying  normal  densities  (the  same  underlying 
densities  previously  used  in  Figure  5.7).  For  this  example,  the  Zhou  confidence  bounds 
begin  to  provide  appropriate  coverage  over  most  false  alarm  probabilities  for  somewhat 
lower  numbers  of  samples.  Thus,  a  paradox  is  introduced:  the  Zhou  approach  can 
provide  appropriate  coverage  for  "enough"  samples,  but  in  order  to  known  how  many 
samples  are  "enough"  some  knowledge  of  the  underlying  densities  is  needed. 

Low  numbers  of  samples  are  available,  there  is  no  prior  knowledge  of  target  and 
non-target  density  form,  and  highly  conseri’ative  confidence  bands  are  acceptable.  Here 
the  Hilgers  [Hilgers,  1991]  method  is  an  appropriate  choice. 

Low  numbers  of  samples  are  available,  target  and  non-target  densities  are  known  to  be 
normal  or  normal  by  some  transformation,  and  the  probability  of  target  given  score  is 
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known  to  monotonically  increase  for  increased  score.  The  binormal  approach,  which 
attempts  such  assumptions  (see  Section  2.7.1),  may  be  appropriate  in  this  case. 

The  objective  of  the  comparison  detailed  in  the  previous  section  is  to  demonstrate  the 
viability  of  the  framework  developed  here,  not  to  prove  that  a  selected  model  that  uses 
this  framework  outperforms  other  approaches  in  every  case  (particularly  when  the 
selected  model  is  not  correct).  Also,  a  key  objective  of  the  research  here  is  to  develop  a 
performance  metric  uncertainty  estimation  approach  that  extends  to  the  CEG  curve. 

The  amount  of  time  to  execute  a  run  (i.e.  to  move  from  a  set  of  target  and  non-target 
samples  to  obtaining  a  confidence  band)  must  also  be  considered.  For  the  method 
developed  here,  two  primary  factors  contribute  to  run  time. 

First,  consider  the  computation  of  target  and  non-target  posterior  parameter  densities, 
which  are  developed  prior  to  any  ROC  curve  formulation.  The  time  to  approximate 
posterior  parameter  densities  depends  on  the  number  of  target  and  non-target  parameter 
points  selected.  Consider  the  parameter  point  selection  process.  For  the  beta  density 
model,  the  process  implemented  here  starts  with  300  target  points  uniformly  selected 
over  mean  and  standard  deviation  and  300  non-target  points  also  uniformly  selected  over 
mean  and  standard  deviation.  Then  the  combined  posterior  weightings  are  found  for  the 
sample  values  (see  Equation  (3.14)).  The  16  grid  point  combinations  that  are  closest  to 
the  mean  and  standard  deviation  of  the  samples  are  kept  (4  target  points,  and  4  non-target 
points),  along  with  any  combinations  that  are  greater  in  combined  posterior  weighting  to 
any  of  these  combinations.  Then  a  10  x  10  grid  (100  points)  for  target  means  and 
standard  deviations  and  a  10  x  10  grid  (100  points)  for  non-target  means  and  standard 
deviations  is  formed  over  this  region,  with  much  smaller  grid  point  spacing.  Again  the 
combined  posterior  parameter  weightings  are  found  for  each  of  the  10,000  grid  point 
combinations,  and  only  those  points  that  contribute  to  99.9%  of  the  total  posterior 
parameter  weighting  among  these  combinations  are  retained.  The  retained  posterior 
parameter  weightings  then  comprise  an  even  smaller  region  than  the  previous  iteration. 
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A  second  10  x  10  grid  (100  points)  for  target  means  and  standard  deviations  and  a  10  x 
10  grid  (100  points)  for  non-target  means  and  standard  deviations  is  then  found.  Again, 
the  grid  points  that  contribute  to  99.9%  of  the  total  posterior  parameter  weighting  among 
the  10,000  combinations  of  grid  points  are  retained.  The  above  operations  for  an 
example  set  of  20  target  and  20  non-target  samples  takes  approximately  70  seconds  using 
the  Matlab  code  developed  here. 

The  second  factor  is  ROC  curve  computation  time.  Each  of  the  retained  target  and 
non-target  grid  point  combinations  form  ROC  curves,  and  these  ROC  curves  must  be 
computed  (see  Figure  3.8).  Computation  of  each  ROC  curve  takes  approximately  0.75 
seconds;  the  total  run  time  for  this  section  depends  on  the  number  of  grid  point 
combinations  that  make  up  the  99.9%  of  the  final  set  of  grid  points  (which  can  range 
from  approximately  200  to  10000).  Total  run  time  for  20  target  samples  and  20 
non-target  samples  generated  from  the  densities  of  Zhou  example  2/3  (see  Figure  5.4)  for 
a  single  example  run  is  244  seconds  (assuming  the  beta  density  model  process  described 
here).  Total  run  time  for  50  target  samples  and  50  non-target  samples  for  this  same  type 
of  run  is  251  seconds.  In  comparison,  a  method  that  implements  Zhou’s  process 
(adjusted  bootstrap  with  250  bootstrap  replications)  in  Matlab  takes  15.5  seconds  for  the 
same  20  target  and  20  non-target  samples  and  33.5  seconds  for  the  same  50  target 
samples  and  50  non-target  samples.  Samples  generated  from  other  density  pairs  can  take 
significantly  longer  per  run  for  the  method  developed  here.  A  similar  process,  again  for 
Zhou’s  example  2/3  for  the  CEG  curve,  takes  170  seconds  for  20  target  and  20  non-target 
samples  and  154  seconds  for  50  target  and  non-target  samples.  An  increase  in  target  and 
non-target  samples  can  result  in  fewer  grid  point  combinations  in  the  final  99.9%,  so  run 
time  may  decrease  with  increase  in  samples.  Also,  the  computation  of  a  particular  CEG 
curve  (required  for  each  grid  point  retained  in  the  final  set)  is  faster  than  the  computation 
of  a  ROC  curve,  so  the  process  is  faster  for  CEG  curve  confidence  intervals  than  ROC 
curve  confidence  intervals.  An  increase  in  number  of  target  samples  leads  to  a  more 
highly  peaked  posterior  probability  density  weighting,  so  the  number  of  grid  points  used 


5-19 


may  need  adjustment  as  sample  size  changes.  Figure  5.12  shows  coverage  examples; 
note  that  results  converge  as  grid  point  spacing  increases. 


More  complex  density  models  (such  as  beta  mixture  models)  can  require  significantly 
more  grid  points  to  cover  the  entire  relevant  parameter  space  (note  also  that  the  regions  of 
high  density  weighting  may  be  disjoint).  Also,  as  the  number  of  grid  points  becomes 
large,  computation  time  increases  proportional  to  the  number  of  grid  points  squared;  a 
small  increase  in  number  of  grid  points  results  in  a  large  increase  in  run  time. 

Most  of  the  computational  challenges  in  terms  of  run  time  are  apparent  when  attempts 
are  made  to  verify  results  by  determining  coverage  accuracy  (e.g.,  the  confidence  band 
development  process  is  repeated  many  times,  such  as  100  or  more  sets  of  30  target  and  30 
non-target  samples  generated  from  the  same  underlying  target  and  non-target  densities). 

Appendix  C  includes  code  to  generate  ROC  curve  and  CEG  curve  confidence  intervals. 
The  appendix  provides  code  for  the  beta  density  model,  along  with  code  for  the  two-beta 
mixture  model.  For  the  two-beta  mixture  models  there  are  significantly  more  parameters 
(five  versus  two  for  the  single  beta  model),  and  the  above  grid  point  iteration  procedure  is 
not  applied.  Instead,  the  process  selects  two-beta  grid  points  at  random,  and  calculates 
the  combined  posterior  weighting  for  such  grid  points.  The  user  specifies  the  number  of 
random  grid  points  for  the  two-beta  mixture  model;  a  typical  number  is  10000.  The 
number  of  random  grid  points  may  be  increased  until  convergence  is  observed.  The 
number  of  points  necessary  for  convergence  depends  on  the  specific  sample  values. 
Matlab  matrix  size  limitations  constrain  the  number  of  grid  points  to  about  20000 
(depending  on  the  specific  sample  values).  Methods  available  to  improve  run  time  are 
noted  in  the  Future  Work  discussion. 

Here  the  uncertainty  estimation  methods  developed  in  Chapters  3  and  4  were  compared 
with  the  current  literature.  The  next  chapter  provides  a  summary  of  the  results  of  the 
research  and  also  identifies  areas  of  interest  for  future  work. 
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Figure  5.12  Regions  that  make  up  selected  percentages  of  the  posterior  parameter  den¬ 
sity.  The  four  plots  show  the  regions  that  encompass  10%,  30%,  50%, 
and  90%  of  posterior  parameter  weighting  for  an  example  where  a  set  of 
samples  is  generated  from  a  target  beta  density. 
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6.  Accomplishments,  Contributions,  and  Future  Work 


Section  6. 1  reviews  the  accomplishments  and  contributions  of  this  research,  and  Section 
6.2  describes  areas  of  interest  for  future  work. 

Prior  to  listing  the  specific  accomplishments  of  this  research  (in  the  next  section),  the 
results  of  the  research  presented  here  are  placed  in  a  proper  perspective. 

The  primary  contributions  of  this  work  are  the  framework  described  most  fully  in 
Chapter  3.  Theorem  3.2,  "ROC  curve  density",  develops  an  analytical  approach  for 
forming  the  posterior  probability  density  of  the  ROC  curve.  This  theorem  enables  an 
exact  description  of  the  ROC  curve  probability  density  for  given  target  and  non-target 
samples,  density  model  assumptions,  and  prior  densities  of  model  parameters.  Theorem 
3.3,  "Numerical  approximation  of  ROC  curve  density",  extends  this  analytical 
description  to  a  form  that  is  computationally  practical.  Also  important  as  a  primary 
accomplishment  is  the  extension  of  the  probability  density  developments  in  Chapter  3  to 
confidence  intervals  (as  described  in  Section  4.1.3). 

The  potential  usefulness  of  the  framework  is  further  emphasized  through  a  verification 
and  evaluation  process  that  includes  comparisons  with  other  methods.  While  the 
comparisons  are  interesting,  it  is  improper  to  place  undue  emphasis  on  the  results  of  the 
verification  and  evaluation  process  (Chapters  4  and  5)  as  primary  contributions  of  this 
research,  even  though  these  results  show  promise.  The  theorems  and  further  descriptions 
of  Chapters  3  and  4  enable  "actual  probability  density  statements"  [see  Carlin,  2000,  pp. 
35-36]  for  a  single  set  (or  run)  of  target  and  non-target  score  samples,  for  given  models, 
and  for  given  prior  assumptions. 

Thus,  there  is  no  need  to  evaluate  results  based  on  the  method  developed  here  over  many 
runs,  although  such  runs  can  indicate  efficacy.  The  exactness  over  one  run  of  the 
Bayesian  approach  is  arguably  more  importanty  than  what  occurs  "on  average"  over 
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many  runs.  Alternative  methods  of  obtaining  confidence  intervals  can  be  accurate  (on 
average)  over  many  runs,  but  make  no  claim  regarding  the  results  of  any  particular  run. 
The  approach  introduced  here  enables  an  actual  probability  statement  to  be  made  from 
only  one  run,  but  it  is  not  possible  to  verify  or  evaluate  correctness  except  over  many 
runs.  Obviously,  it  would  be  desirable  to  have  a  process  that  provides  an  actual 
probability  statement  for  one  run  and  that  also  behaves  appropriately  over  many  runs 
(which  the  method  developed  here  clearly  does).  In  considering  which  approach  is  best, 
note  that  there  will  often  be  only  one  set  of  samples,  so  making  the  most  appropriate 
statement  possible  based  on  only  one  run  is  arguably  more  important  than  what  occurs 
over  many  mns. 

The  density  model  example  assumed  in  this  research  is  beta-based  (predominantly 
focused  on  a  unimodal  beta  density  model).  This  model  is  merely  as  an  example 
application  of  Theorems  3.2  and  3.3.  Because  the  scores  that  are  inputs  in  this  research 
are  continuous  between  zero  and  one,  the  beta  density  seems  appropriate  (see 
[Kagan  et  al,  1973]);  however,  this  research  has  not  and  does  not  intend  to  show  that  the 
beta  density  is  effective  and/or  appropriate  when  the  model  density  is  not  known.  In 
particular,  it  is  not  the  objective  of  this  research  to  show  that  the  beta  density  always 
provides  a  good  estimate  for  all  sets  of  data  samples  when  model  form  is  not  known;  the 
beta  density  model  is  simply  an  example.  Thus,  a  caution  on  the  results  in  Chapter  5  is 
that  the  comparisons  with  existing  research  do  not  enable  true  "apples-to-apples" 
comparisons;  the  comparisons  made  in  Chapter  5,  while  appropriate  in  demonstrating  the 
Bayesian  framework,  do  not  show  that  the  method  developed  here  is  necessarily  an 
improvement  over  existing  approaches.  In  simply  demonstrating  the  framework,  the 
method  developed  here  generally  uses  samples  from  beta  densities  where  the  parameters 
are  assumed  to  be  unknown.  (The  method  developed  here  then  uses  the  samples  to 
develop  probability  densities  for  the  unknown  parameters.)  Also,  the  comparisons  are 
generally  made  with  methods  that  make  differing  model  assumptions.  Note  that 
currently  available  methods  in  the  literature  do  not  enable  the  selection  of  a  beta  density 
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model.  As  discussed  below  and  in  future  work,  it  would  be  of  interest  for  future 
developments  to  incorporate  the  results  of  Chapters  3  and  4  into  flexible  software  that 
enables  user  selection  of  densities  or  model  assumptions. 

Unless  one  is  guaranteed  that  a  particular  model  assumption  or  prior  is  correct,  a 
reasonable  question  concerns  the  usefulness  of  the  results  of  the  framework  developed 
here.  Consider  the  available  alternatives.  Bootstrap  based  approaches  avoid 
assumptions,  but  as  is  shown  in  Figure  5.10,  unless  large  numbers  of  samples  are 
available,  avoiding  such  assumptions  can  yield  poor  results  (certainly  if  large  numbers  of 
samples  are  available,  then  bootstrapping  based  approaches  are  very  much  of  interest). 
Existing  research,  with  the  exception  of  bootstrap-based  approaches,  make  model 
assumptions;  the  framework  presented  here  also  makes  model  assumptions.  The 
difference  between  the  method  developed  here  and  other  approaches  is  that  the  other 
approaches  develop  frameworks  that  involve  restrictive  model  assumptions.  The 
framework  developed  here  enables  flexible  model  assumptions.  In  this  regard  (as  future 
work)  the  framework  developed  here  could  be  extended  so  that,  for  example,  the  user 
might  specify  "bi-modal  density  mixture  model",  "tri-modal  beta  density  mixture 
model",  etc. 

Another  question  is  that  if  it  is  not  known  whether  or  not  a  set  of  samples  is  modeled  well 
by  a  beta  density  model,  how  could  the  research  presented  here  possibly  be  of  interest? 
Two  considerations  are  as  follows.  First,  as  future  work,  an  examination  of  the  fit  of  a 
beta  density  model  to  experimental  data  with  fixed  end  points  is  of  interest.  Second,  an 
extension  that  also  may  be  of  interest  for  future  work  is  the  incorporation  of  models  of 
varying  complexity,  which  is  possible  through  regularization  (see  [Bishop,  1995])  and  the 
use  of  the  Occam  factor  (see  [Gregory,  2005]).  Such  approaches  do  not  select  a  beta 
density  model  or  a  bi-modal  density  model,  etc.,  instead  they  incorporate  models  of 
different  complexities;  less  complex  models,  such  as  single  beta  densities,  receive  higher 
overall  weighting,  more  complex  models  receive  less  overall  weighting  (even  though 
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there  may  be  specific  instances  where  such  weightings  fit  the  data  better).  The  use  of 
roughness  is  an  alternative  approach  that  incorporates  models  of  various  complexity  (see 
discussion  in  Future  Work,  and  related  results  presented  in  Appendix  B). 

6.1  Accomplishments  and  contributions 

This  research  applies  a  new  framework  for  ROC  curve  uncertainty  estimation  that  is  fully 
Bayesian,  that  is  numerically  tractable,  and  that  leads  to  substantial  improvements  over 
existing  methods.  Quantitative  comparisons  are  made;  however,  qualitative 
improvements  are  the  most  important  outcome  of  the  research  presented  here.  As 
discussed  in  Chapters  2  and  5,  most  existing  methods  make  restrictive  assumptions  that 
inhibit  the  application  of  a  flexible  model  framework  as  presented  here;  the  bootstrap 
approaches  do  not  require  such  assumptions  but  are  of  limited  applicability  for  small 
numbers  of  samples. 

A  significant  aspect  of  this  research  is  that  the  uncertainty  estimation  process  developed 
here  transitions  to  CEG  curves.  The  CEG  curve  is  a  critical  metric  for  AFRL  in 
determining  the  usefulness  of  target  detection  systems.  With  a  typically  limited  amount 
of  data  and  with  no  appropriate  methods  for  CEG  curve  uncertainty  estimation,  AFRL 
has  previously  been  able  to  make  only  limited  use  of  this  metric.  With  the  methods 
developed  here,  the  CEG  curve  can  be  applied  and  its  uncertainty  can  be  estimated  even 
for  low  numbers  of  samples. 

The  research  reported  here  demonstrates  the  application  of  ROC  curve  uncertainty 
estimation  methods  from  the  medical  community  to  target  detection.  It  also  provides 
more  comprehensive  qualitative  and  quantitative  comparisons  of  alternative  ROC  curve 
and  AUC  value  uncertainty  estimation  approaches  than  any  available  in  the  literature. 

ROC  curve  density  and  confidence  interval  generation.  This  research  applies  a 
Bayesian  framework  to  develop  new  methods  for  ROC  curve  density  generation  which 
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are  also  applicable  to  other  target  detection  performance  metrics.  The  framework  is 
provided  within  Chapter  3  (which  includes  four  theorems,  a  lemma  and  a  procedure); 
more  specifically.  Theorems  3.2  provides  an  analytical  approach  for  forming  the 
probability  density  of  the  ROC  curve,  and  Theorem  3.3  extends  this  analytical 
description  into  a  form  that  is  practical  to  evaluate  analytically.  Note  that  while  ROC 
curve  definitions  are  examined  in  the  previous  literature  (see  [Lloyd,  2002]  and 
[Zhou  and  Qin,  2005]),  the  probability  density  results  obtained  here  are  unprecedented. 

Computations  of  confidence  bands  or  confidence  intervals  (as  described  in  Section  4.1.3) 
can  be  made  from  the  performance  metric  densities  in  a  straightforward  manner.  This 
capability  contrasts  with  previous  methods  in  the  literature,  which  generally  are 
applicable  only  to  specific  band  or  interval  definitions  and  which  can  not  be  easily 
extended.  Application  of  the  Bayesian  framework  allows  the  user  of  a  SUT  to  better 
understand  conclusions  from  performance  metrics,  especially  if  they  are  based  on  limited 
data. 

This  research  presents  the  results  of  simulations  and  real-data  experiments  that 
demonstrate  the  significance  of  the  new  uncertainty  estimation  methods.  Computational 
techniques  that  implement  the  methods  are  demonstrated,  and  they  are  shown  to  yield 
accurate  results  that  are  otherwise  not  analytically  tractable.  Significantly,  the  methods 
developed  here  enable  the  calculation  of  actual  performance  metric  probability  densities 
for  given  target  and  non-target  score  samples,  given  density  forms  for  the  scores,  and 
given  prior  densities  for  the  parameters  in  these  forms. 

Representative  ROC  curx’e  generation.  This  research  develops  methods  that  generate 
representative  ROC  curves  (samples  from  a  ROC  curve  density)  from  given  sets  of  target 
and  non-target  samples.  Numerical  implementation  of  the  method  for  generating  the 
ROC  (and  CEG)  curve  densities  results  in  the  generation  of  representative  ROC  (and 
CEG)  curves.  Macskassy  [Macskassy  and  Provost,  2004]  [Macskassy  etal.,  2005]  most 
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recently  emphasizes  the  critical  need  for  such  representative  ROC  curves  and  the  lack  of 
such  ROC  curves  in  the  literature.  From  such  representative  ROC  curves  (or 
representative  CEG  curves),  many  descriptive  statistics,  such  as  mean  and  median  ROC 
curves  and  AUC  values  and  confidence  bands  and  intervals  for  them,  are  obtained.  The 
results  are  shown  to  be  robust  when  the  overall  model  density  form  assumptions  are 
correct. 

CEG  curx’e  density,  representative  CEG  curve  generation,  and  confidence  interx’al 
generation.  The  methods  developed  here  can  be  applied  to  CEG  curves.  The  lack  of  a 
proven  means  for  obtaining  confidence  intervals  for  the  CEG  curve  was  a  primary 
motivation  for  AFRL  sponsorship  of  this  research.  The  research  reported  here  goes 
beyond  simply  adapting  an  existing  ROC  curve  confidence  interval  estimation  method 
and  applying  it  to  the  ROC  curve.  Instead,  it  applies  a  Bayesian  framework  to  create, 
demonstrate,  and  validate  new  methods  that  can  be  applied  beyond  the  uncertainty 
estimation  problem  originally  addressed. 

Target  and  non-target  density  flexibility.  Although  the  examples  considered  here  use 

beta  densities,  the  methods  developed  here  can  be  directly  applied  to  other  density  forms. 
In  contrast,  the  binormal  ROC  curve  in  predominant  use  implies  a  nearly  convex  ROC 
curve  form  and  restricts  curve  estimation  to  this  form.  The  methods  developed  here  are 
particularly  important  for  cases  where  sample  size  is  small,  as  is  typical  in  target 
detection  problems.  Thus,  this  research  is  expected  to  alter  the  way  that  the  target 
detection  evaluation  community  approaches  ROC  and  CEG  curve  uncertainty  estimation. 

6.2  Future  work 

The  success  of  this  research  should  motivate  further  investigation  in  several  areas: 
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1.  Improve  the  efficiency  of  target  and  non-target  density  posterior  parameter 
computation.  As  the  number  of  parameter  evaluation  points  increases  (see  Figure  4.9), 
the  ROC  curve  density  converges  (see  Theorem  3.3)  provided  that  the  relative  spacing  of 
the  points  does  not  change  (for  example,  the  spacing  is  kept  uniform  over  mean  and 
standard  deviation).  More  computationally  efficient  methods  to  obtain  sufficient 
numbers  of  evaluation  points  should  be  investigated.  This  optimization  would  assist  in 
the  transfer  of  the  Bayesian  framework  to  more  complex  density  models.  Jordan 
[Jordan  et  al.,  1999]  focuses  on  a  variational  approach  and  references  alternatives  such  as 
the  pruning  algorithm,  bounding  conditioning,  search-based  methods,  and  localized 
partial  evaluation.  Bos  [Bos,  2002]  describes  alternatives  such  as  Gibbs  sampling  and 
importance  sampling.  Madigan  [Madigan  and  Raftery,  1994],  Raftery 

[Raftery  et  al.,  2003],  and  Hoeting  [Hoeting  et  al.,  1999]  reference  Bayesian  model 
averaging  and  Occam’s  window  for  reducing  the  computational  complexity  of  posterior 
parameter  density  evaluation. 

2.  Develop  integrated  confidence  band  computation  approaches.  As  noted  in  Section 
2.7,  while  the  framework  used  and  the  methods  developed  here  apply  to  many  types  of 
ROC  curve  uncertainty  estimation,  there  are  other  approaches  that  may  be  acceptable  in 
particular  cases.  For  example,  the  binomial  approach  provides  bands  that  encompass 
greater  than  or  equal  to  95%  coverage  for  95%  confidence  bands.  Confidence  bands 
based  on  the  binomial  approach  are  overly  conservative  but  may  be  applied  as  an  upper 
bound  to  ROC  curve  confidence  bands  for  the  method  developed  here.  Thus,  relevant 
aspects  of  each  of  the  approaches  may  be  combined  to  achieve  joint-method  ROC  curve 
confidence  bands. 

3.  Test  the  methods  developed  here  with  other  density  models.  Example  alternative 
density  models  include  hybrid  models  that  combine  Gaussian  densities,  beta  densities,  or 
both.  In  such  a  combination  approach,  density  models  that  have  higher  complexity,  even 
if  they  fit  the  data  well,  may  be  regarded  as  less  likely  to  represent  the  true  model  (see 
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[Mac Kay,  1992b]).  Here  complexity  could  refer  to  the  number  of  parameters  in  the 
model,  e.g.,  a  single-beta  density  has  two  parameters  (mean  and  standard  deviation)  and  a 
two-beta  mixture  density  has  five  parameters  (two  means  and  standard  deviations  plus  an 
amplitude  ratio).  Regularization  techniques  can  combine  models  of  varying  numbers  of 
parameters  (see  Bishop  [Bishop,  1995]).  To  avoid  the  possible  over-fitting  effects  of 
more  complex  densities,  target  and  non-target  score  density  function  roughness  or  ROC 
or  CEG  curve  roughness  could  be  used  to  quantify  complexity.  Appendix  B  addresses 
related  issues  by  first  examining  interpolation  methods  that  have  desirable  extrapolation 
properties  based  on  roughness;  it  then  describes  an  analytical  approach  for  roughness 
computation,  where  the  roughness  of  a  function  is  defined  as  its  integrated  squared 
second  derivative.  Approaches  that  incorporate  roughness  recognize,  for  example,  that  a 
density  function  with  large  roughness  that  describes  the  data  well  may  be  less  desirable 
than  a  density  function  that  describes  the  data  less  well  but  that  has  low  roughness. 

4.  Apply  the  methods  developed  here  to  additional  performance  metrics.  Once  the  ROC 
curve  density  is  developed,  the  research  presented  here  shows  that  transition  to  the  CEG 
curve  density  is  straightforward.  This  transition  could  be  made  to  other  performance 
metrics,  including  the  Dice  similarity  coefficient  (see  Zou  [Zou  et  al.,  2004]),  mutual 
information  (see  Zou  [Zou  et  al.,  2004]),  partial  AUC  (see  [Dodd  and  Pepe,  2003]),  and 
the  Youden  index  (see  Faraggi  [Faraggi,  2003]). 
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Appendix  A.  Analytical  Derivations  and  Numerical  Approximations 


A.  1  Derivation  of  ROC  curve 
Theorem  Score-threshold  ROC  curx’e 

Let  /(s;  u)  and  g(s ;  v )  be  densities  of  s  given  u  and  v,  where  s  is  a  scalar  and  u  and  v  are 
vectors,  assume  that  /(s;  «)  and  r/(s;  u)  are  integrable  V  u,  v  and  let 
F(t;u)  —  ft°°  f(s;  u)ds  and  G(t\  v)  =  ft°°  g(s-,v)ds.  Also  let  w  —  [ui  u2  •••  v\  V2  ■■■]  and 
F(t;  u)  =  1  —  F{t\  u)  and  G(t\  u)  =  1  —  F(t\  u )  where  F(t;u)  and  G*(f;  n)  are 
cumulative  probability  distributions.  Let  x  =  F(t ;  u)  and  y  =  G(t:  v).  Assume  there  is 
a  unique  correspondence  of  s  to  F(s;  u )  such  that  0  <  F(s:  u)  <  1  and  F  1  is  invertible 
(by  the  Implicit  and  Inverse  function  theorems;  see  [Olmstead,  1961]).  Then 
y  =  r{x ;  w),  where  r  =  GF-1. 

Proof 

If  F(s;  u )  =  j  '\x  f(s ;  w)ds  is  a  cumulative  distribution  function  (CDF),  then  [Stark  and 
Woods,  1986,  pp.  42] 


Pu(si  <  X  <  s2)  =  F(s2;  u )  -  F(sp  u)  >  0  for  si  <  s2. 


(A.l) 


If  f(s;  u )  is  a  probability  density  function  (PDF),  then  [Stark  and  Woods,  1986,  pp.  44] 


"S  2 


f(s-,u)ds  =  Pu(si  <  S  <  s2). 


(A. 2) 


r  Si 


By  Equations  (A.l)  and  (A. 2), 


"S2 


f(s;u)ds  =  F(s2;u )  —  F(si;w). 


(A. 3) 


'  si 
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Also  [Stark  and  Woods,  1986,  pp.  41], 


F(—oo;u)  =  0,  F(oo;u)  =  1. 


(A.4) 


By  Equations  (A. 3)  and  (A.4), 


f(s;  u)ds  =  F{ oo;  u )  —  F(si ;  u)  =  1  —  F(si ;  u ). 


(A. 5) 


Since  it  has  been  defined  that  x  =  ft°°  f(s ;  u)ds  =  F[t ;  u).  by  Equation  (A. 5) 


x  =  J  f(s',u)ds  —  l  —  F(t',u)  —  F(t',u). 


(A. 6) 


Using  an  identical  argument, 


y  =  J  g(s;  v)ds  =  1  —  G(t;  v)  =  G(t;  v) 


(A.7) 


Further,  since  F(s;  u)  —  l  —  F(s ;  u),  and  since  F(si;  u)  <  F(s2;  u)  for  si  <  s2, 


F(si,u)  >  F(s2',u).  (A. 8) 

Since  F  is  continuous  from  the  right  [Stark  and  Woods,  1986,  pp.  44],  i.e., 

F(s ;  u )  =  lim^o  F(s  +  e;w),  e  >  0,  and  since  F(s;  u)  —  1  —  F(s ;  -u), 

F(—oc;u)  =  0;  F(oo;u)  =  1,  (A. 9) 

F  is  continuous  from  the  left,  i.e.,  F(s;  ti)  =  lim£^0  F(s  —  e;  u),  e  >  0. 
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Since  F~l  is  invertible  and  unique  for  each  m,  then  for  any  x,  0  <  x  <  1, 

x  =  J°°  f(s;u)ds  for  some  unique  threshold  t  =  F~l(x\u),  G  o  F~1(x;w)  =  G(t;u), 
x  =  F(t;  u ),  and  it  follows  that  y  =  r{x ;  w),  where  r  —  G  o  F~1(x ;  w). 

Comments 


If  /(s;  u)  and  g(s]  v )  are  modeled  by  beta  probability  densities,  then  u  and  v  are 
two-element  vectors,  and 


f{s;u)  = 


sUl_1(l  -  s)^"1 

r(gi)r(tt2) 

r(Si+«2) 


,  0  <  5  <  1, 


(A.  10) 


oVl— 1(1  _ 

9(s-,v)  =  *  rLr,l[ - ,  0<S<1, 


rQi)r(t>2) 

r(?i+t;2) 


(A.  11) 


where  u  and  v  are  related  to  u  and  v  by 


Mi  l -Mi  n 

Ml  =  Ml - 1 

U2 

~  r  1 

U2  =  U  i - 1 

Mi 


(A. 12) 
(A.  13) 


Vl 


V2 


_ 

V2 


and  where 


Thus,  x  =  F(t ;  m) 


T  (a)—  e  fta  1dt,a>  0. 


ft°°  f(s ;  u)ds  may  be  expressed 


f1 

X  ~  ,  r(g!)r(s2)  ds > 

r(«x+«2) 


(A.  14) 
(A.  15) 


(A. 16) 


(A. 17) 
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1  - 
0.8- 


0  0 


Figure  A.  1 


Here 


r(Si)r(t»2) 

r(«i+«2) 


is  shown  as  a  function  of  u\  and  u2. 


where  t  is  a  selected  threshold  and  0  <  t  <  1.  Figure  A.l  shows  the  relation  of  u  \ 

r(Si)r(g2) 

r(ui+w2) 


Evaluation  using  Weierstrass’  product  [Korn  and  Korn,  2000,  pp.  822],  shows  that 
Ir(a+1)'>  may  factored  int°  the  infinite  sum 


1 

W) 


OO 

zec«  nid  + 


where  C  &  0.5772157  is  the  Euler-Mascheroni  constant  and 


r(q)r  (6) 

T(a  +  6) 


a  +  b  tt  (k  +  a  +  b)(k) 
a b  (k  +  a)(k  +  b) 


Note  (see  [Patel  et  al.,  1976])  that 


sUl_1(l  -  s)"2"1 

r(gi)r(g2) 
r(Si  +u2 ) 


Bt{u1,u2) 

r(gi)r(g2) 

r(Si+u2) 


h(ui,u2), 


u2,  to 


(A.  18) 


(A. 19) 


(A. 20) 
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where 


Bt(u1,u2)=  /  sui  1  (1  s)U2  1ds 

Jo 

and  It  is  the  incomplete  beta  function  ratio.  Also  note  that 


(A. 21) 


r1  2-1 

rtgprtua)  ds  ~  1 
' '  r(ui+u2) 


:  s"1”1]! ■ 


I  l—A±^)Hds 

r(gQr(g2) 

70  r(g!+g2) 


so  that 


/  - ixgorfe) - ^  =  1  - /t(Ul)«2). 

1/4  r(S!+s2) 

For  the  incomplete  beta  function  ratio  [Patel  et a/.,  1976,  pp.  246] 


(A. 22) 


(A. 23) 


It(uiu2)  =  1  -  h-t(u2  Ui)i 


(A. 24) 


so  that 


Therefore 


1  sUl_1(l  -  s)M2_1 
r(gi)r(^f 


-ds  =  1  —  (1  —  h-t(u2ui)). 


it 


r(«i+w2) 

“!  _  5)ua-l 

r(gi)r(g2) 


ds  =  h-t(u2ui). 


(A. 25) 


(A. 26) 


r(ui+u2) 

From  above,  and  noting  that  Equations  (A.12)-(A.14)  may  be  manipulated  to  solve  for 

Ul,  u2v i,  and  v2  using  Ul  =  m2  =  (gl+aJg(2gl+S2)2 ,  Mi  =  and 


m2  = 


(l'l+'U2  +  l)(^l+,y2)2  ’ 


x  =  h-t(u2ui)  =  F(t ; 


Ml 


MXM2 


mi  +  m2’  («i  +  m2  +  1)(mi  +  m2)2 


(A. 27) 


and  similarly 


y  =  Ii-t{v2vi)  =  G(t ; 


«i 


MiM2 


Ml  +  M2  ’  (Ml  +  M2  +  1)  (Ml  +V2)2' 


(A.28) 
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Therefore,  for  a  beta  density  model  and  for  given  values  of  u  and  v. 


y  =  r(x;w)  =  Go  F  1(x-,w)  =  and  F(t;u)  =  Ii-t(u2ui).  (A.29) 

Thus,  whereas  there  are  various  ways  to  describe  r  (such  as  an  infinite  series  of  products, 
gamma  functions,  and  the  incomplete  beta  function  ratio),  such  expressions  are 
impractical  to  further  evaluate  analytically.  Even  if  they  were  practical  to  evaluate,  the 
analytical  expressions  would  be  for  the  ROC  curve,  not  for  the  ROC  curve  density. 
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A.  2  Derivation  of  ROC  curve  density 


Theorem  3.2  ROC  curve  density 

Let  d  —  {sj  :  i  —  1, be  a  set  of  independent  and  identically  distributed  samples 
from  distribution  /.  Let  h  =  {q3  :  j  =  1, J}  be  a  set  of  independent  and  identically 
distributed  samples  q3  from  distribution  g.  and  let  pu(u)  and  pv(v)  be  prior  densities  of 
the  random  parameter  vectors  u  and  v.  Let  A  be  the  admissible  set  of  u  and  v 
parameters.  Then 


Py\x(y\x,  d,  h)  =  C0  Py\x(y\x,  u,  v)  JJ  /(s<;  u)  JJ  g(qf,  v)pu(u)pv(v)dudv,  (A.30) 


where  the  constant  C0  depends  on  d  and  h  . 

Proof 

Let  w  be  the  concatenation  of  u  and  v  (i.e.,  w  —  [ui  u2  ...  v\  v2  ...]),  and  let  D  be  the 
concatenation  of  d  and  h. 


By  marginalization 


py\x(y\x,D)  =  lpv\x(y\x,w)pwD(w\D)dw, 

J  A 


and  by  Bayes’  rule, 


pw\D{w\D)  =  CipD\w  (D  |  w)pw  (w), 


where  the  constant  C\  depends  on  I). 


(A. 31) 


(A. 32) 
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Thus,  by  independence 


i  j 

Pd\w(D\w)  =  C2Y[f(si;u)Y[g(qj-,v), 

i= 1  i=l 


(A. 33) 


where  the  constant  C2  depends  on  d  and  h  . 


and 


Pw(w)  =  Pu(u)pv(v). 


(A. 34) 


Combining  Equations  (A. 32),  (A. 33),  and  (A. 34)  shows  that  Equation  (A. 31)  is 
equivalent  to 


Pv\x(y\x,D)  =  C0 


Pv\x(y\x,u,v)Y[f(si;u) 

i 


II  g{qjC’)pu(u)pv(v)du  dv, 

j 


(A. 35) 


where  the  constant  C0  depends  on  d  and  h. 

Note  that  A  is  used  here  rather  than  A  because  notation  earlier  in  this  document  (see 
Equation  (3.3))  refers  to  the  admissible  set  for  the  beta  density  model  as  A,  and  this  proof 
is  not  restricted  to  the  beta  density  model. 

Comments 

For  a  beta  density, 

Pyx(y\x,  D)  =  Co  f™  /“  /r  ir  Pyx(y\x,  U1,U2,  Thv2)  JJ  /(s*;  uiu2)  ]I  g(qj-,v iv2) 

i  j 

■PuiA;,  (fr+fr+ijfr+S^ )du1du2(lvi(E2l 

and  pyx(y\x,  V1.V2)  =  S(y  -  r(x;w)), 
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where 


r(x;v2vi,u2ui)  =  G  o  F~x(x;  v2vx,u2ui)  =  F(t;u)  =  h-t(u2ui) 

(A. 36) 

and  u  and  v  are  related  to  u  and  v  by  Equations  (A.12)-(A.14).  Also, 


r(S1)r(g2) 

r(«1+u2) 


r(vi)r(v2) 

r(t>i+«2) 


Y[f(si\u1}u2)  =  JJ 

i  i 

Yi9(Qj\vi,v2) = n 

3  3 

One  example  choice  of  parameter  density  prior  has  pu{u\u2)  equal  to  a  constant  over  all 
values  of  iq  and  u2  for  which  the  beta  density  is  defined,  where  tq  is  mean  and  u2  is 
standard  deviation.  With  an  identical  choice  of  priors  for  pv(y\V2),  the  following  bounds 
apply: 


Pn(ui,u2)  =  1,  0  <  Ul  <  0.5,  and  u2  <  Ml(ttl+2)^1+1)2 
Pu{ui  U2)  =  1,  0.5  <  Ul  <  1,  and  u2  < 

Pu(p,i,u2)  =  0,0<Ul<  0.5,  and  u2  >  Ui(mi_^i+1)2 
Pu(u  iu2)  =  0,  0.5  <  «1  <  1,  and  u2  >  Ml(21J"il)2 
Pv(vhv2)  =  1,  0  <  Vl  <  0.5,  and  v2  <  t>l(t>1+2)(«1+1)a 
Pv(v iv2)  =  1,0.5  <  vi  <  1,  and  v2  <  M^)2 
Pv(viV2)  =  0,  0  <  Vl  <  0.5,  and  v2  >  Mvil^l1+1)2 
Pv{v i,v2)  =  0,0.5  <  vi  <  1,  and  v2  > 

Even  for  the  case  of  uniform  prior  density  over  admissible  mean  and  standard  deviations 
and  with  single  beta  densities  (simple  in  comparison  with  beta  mixture  models),  there  are 
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no  less  than  six  incomplete  gamma  functions  inside  the  integral.  Without  considering  the 
definition  for  py\x(y\x,  D ),  since  the  gamma  function  is  itself  analytically  described  by  an 
integral,  an  analytical  solution,  even  for  a  single  beta  density,  is  not  feasible  (multiple 
analytic  terms  inside  the  four  part  integral  would  consist  of 

T(a)  =  /q00  a  >  0).  However,  using  Monte  Carlo  methods,  a  convergent 

numerical  result  may  be  obtained.  Further,  rather  than  the  restrictive  solution  that  an 
analytical  development  would  produce  (restricted  to  single  beta  models),  the  numerical 
development  may  be  extended  to  beta  mixture  models  or  other  families  of  density 
models.  Thus,  based  on  the  analytic  framework  it  is  clear  that  a  numerical  evaluation  is 
needed.  The  evaluation  points  of  Figure  3.6,  shown  within  the  allowed  standard 
deviation  versus  mean  plots,  are  sampling  points  used  to  estimate  the  full  Bayesian 
posterior,  which  may  be  visualized  as  a  three-dimensional  density.  The  oval  regions  of 
the  two  left  plots  of  this  figure,  shown  in  the  vicinity  of  the  target  and  non-target  mean 
and  standard  deviation,  indicate  confidence  interval  bounds  for  the  posterior  probability. 
Similarly,  the  darkened  regions  of  Figure  5.12  indicate  10%,  30%,  50%,  and  90% 
confidence  interval  bounds  for  the  posterior  probability. 
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Appendix  B.  Analytical  derivation  of  Roughness  for  Cardinal 

Interpolation 

B.l  Introduction  and  background  on  cardinal  interpolation 

Gustafson,  Parker,  and  Martin  [Gustafson  et  al.,  2006]  apply  Bayesian  methods  to  find 
the  probability  density  of  certain  interpolating  functions,  where  this  density  has  desirable 
extrapolation  properties  that  define  cardinal  interpolation.  In  this  appendix,  cardinal 
interpolation  and  roughness  are  introduced  and  then  an  analytical  extension  to  Gustafson, 
Parker,  and  Martin  [Gustafson  et  al.,  2006]  is  provided.  As  described  in  future  work 
(Section  6.2),  incorporating  roughness  into  a  target  or  non-target  density  model  can 
provide  a  means  to  characterize  and  control  models  of  various  complexity  for 
performance  metric  uncertainty. 

Development  of  the  cardinal  interpolation  density  provided  an  early  example  for  the 
development  of  densities  for  ROC  and  CEG  curves  that  is  the  key  advance  reported  here. 
Calculation  of  the  cardinal  interpolation  density  is  facilitated  by  an  analytical  derivation 
of  roughness  of  a  sum  of  Gaussian  functions,  where  roughness  is  defined  as  integrated 
squared  second  derivative  of  the  sum  of  the  functions.  The  use  of  roughness  here  is  the 
degree  of  smoothness  in  Bishop  (see  [Bishop,  1995,  pp.  173]).  See  [Bishop,  1995]  and 
[Mac Kay  1992a,  1992b]  for  the  related  discussion  of  regularization. 

The  following  summarizes  the  cardinal  interpolation  concept  and  its  use  of  the  analytical 
derivation  of  roughness.  Gustafson,  Parker,  and  Martin  [Gustafson  et  al.,  2006]  provide 
a  full  description. 

The  cardinal  interpolation  density  combines  a  linear  model  with  a  Gaussian  radial  basis 
function  model.  When  estimating  points  that  are  far  from  observed  data  points,  an 
appropriate  model  is  assumed  to  be  a  least  squares  line;  when  estimating  points  that  are 
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close  to  observed  data  points,  an  appropriate  model  is  assumed  to  be  an  interpolator  (in 
this  case  a  Gaussian  radial  basis  function  interpolator).  Let  data  points  D  = 

(x\,  yi ),  (x2, 1/2),  ••  •  ,  (xD,  yn)  with  x\  <  x2  <  . . .  <  xn  be  samples  from  a  Gaussian 
probability  density  in  y  relative  to  a  line  (see  [Bishop,  1995]).  By  marginalization,  the 
probability  density  p(y\x,  D )  of  y  given  x  and  D  for  a  linear  model  is 
f  p(y\x,  a,  b)p(a,  b\D)da  db,  where  a  is  the  intercept  and  b  is  the  slope  of  the  line.  By 
Bayes’  rule,  p(a,  b\D )  is  proportional  to  p(D\ci,  b)p(a)p(b)  for  independent  a  and  b, 
where  p(D\a,  b)  is  the  product  of  p(y \x,  /9) evaluated  at  each  of  the  data  points  and  is  thus 
proportional  to  the  deviation  weight  exp{—  YhilJi  ~~  a  ~  bxi)2/ (2cr2)].  The  result  is  a 
density  for  the  linear  model  (see  [Bishop,  1995])  that  has  a  mean  which  is  the  least 
squares  line  at  D. 

The  cardinal  interpolation  density  uses  the  above  linear  model  with  a  Gaussian  radial 
basis  interpolating  model.  The  combined  model  is 

y{x ;  a,b,c)  =  a  +  bx  +  Aiexp[—(x  —  a;*)2/ (2c2)],  where  each  basis  function  has  its 
mean  at  a  point  x  value,  has  variance  c2,  and  has  amplitude  A,  such  that  yl  =  y(xt:  a,  b ,  c) 
so  that  the  points  are  interpolated.  Regularization  (see  [Bishop,  1995])  yields  weighting 
that  depends  on  roughness  r(a,  b,  c).  The  cardinal  interpolation  density  is  developed  by 
requiring  that  the  roughness  weight  exp(—Kr(a,  b,  c )  equal  the  above  deviation  weight, 
where  K  is  such  that  both  types  of  weights  have  the  same  minimum. 

B.2  Analytical  roughness  expression 

The  following  expression  for  roughness  has  been  verified  for  many  sum  of  Gaussian 
functions  using  numerical  integration.  The  use  of  this  expression  can  greatly  reduce  the 
number  of  required  computations  as  compared  with  numerical  integration. 
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Theorem 


Let  a,6el  and  c  >  0  and  x\  x2j...,  G  M,  Ai  A2)...,  2ln  G  .M,  and 


71  2 

E(x 

AiC  2C2 

i=l 


(B.l) 


for  a;  G 


Then  roughness,  r(a,  6,  c)  is 


/°o  r~  11  (  Q 

(y"(x]  a,  b,  c))2dx  =  — —  AjAje1  H"  37  +  72  ,  (B.2) 

00  *=i  j=i  ^ 


where 


7  = 


■{Xj  ~  ^j)s 

4c2 


(B.3) 


Proof 


Note  that 


/  (*-**)  \ 
V — ^71 — ) 


dAie  1  2^ 

dx 


Ai  -i/2[(x~xp2 ]  A 


— 2e 
c2 


and 


<92^e  (  2^  ]  _Aj A j  i/2(l^il!)  1/2(^421) 

9a;2  “  c4 


.^ie-l/2(^#)e-l/2(^)(x  _ 


.^e-l/2(^)e-l/2(^)(a.  _  ^ 
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A  A  /or(x  °g)2 \  _i  /or  XA 


i^~\x  -  Xi)\x  _  x  y 


(B.4) 


Then  roughness  for  n  points  is 


r(a,b,c)  =  >  >  ( 


-°°  i=i  j=i  c4 


y  yf^c-i/2(^)c-i/2(^) 


e  '  '  ^  7  e  '  v  c- 


^  —  xj)5 


A  A  _i  /0T5 _ xo~  ^  _i  /of— _ 1  j ,  ,5 

A- A) 


c 


Z  „-l/2(^^)  -l/2( — ^ 

,6  C  C  v~ 


A^i  i/2(  )  i/2(  ) 

C8 


(x  —  a;J2(ic  —  x-)2)dx. 


(B.5) 


Note  that  terms  that  may  be  separately  integrated,  and  that  three  general  forms  appear  in 
Equation  (B.5). 

First  general  form 

The  first  general  form  is  ["'f  e  l'21  A  AI/2(('A  fix.  and 


—  [ °°  e_1/2(fe^)e_1/2(I^)dx  =  HJ 


lx2-\-wx-\-k 


c4  J-00  c4 

where  l  =  A(2 ),w  =  A(2 s  -  2 1),  and  k  =  A(s2  +  f2). 


dx,  (B.6) 
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Substitute  p  —  1/c2,  g  =  —w/2 1  =  and  v  =  k  +  pq 2,  into  Equation  (B.6): 


l/2(^-)e-l/2(^)da,  =  /  gi^+tox+fe^  =  ^  /  e-P(x-tfdx  (B7) 


'(x-t) 


i  «oo  _l£n£l_ 

Note  applying  the  definition  of  a  Gaussian  probability  density,  J_oo  e  ^  dx  —  1, 
yields  the  following  progression: 


e-p(x~q)2dx  =  ev  \[txc. 


(B.8) 


Thus 


HJ 


•  Q~«) 


HJ 


(B.9) 


where  v  =  k  +  pg2  and  k  =  ^-(s2  +  t2),  p  =  1/c2,  and  g  = 


5+t 

2 


Second  general  form 


HJ  f-  _i/2((2L^)  i/2((2y^) 


(x  —  t)2dx 


(B.10) 


Similar  to  Equation  (B.6),  let 


rl/2((^)e-l/2(^)^  _  ^2^  =  f  el^+wx+k^  _  ^2dx 


(B  .11) 


Then,  similar  to  Equation  (B.7),  elx2+wx+k(x  _  f)2dx  =  er  e  —  t)2dx. 

Note  that  er  J ^  e~p<'x~q^2  (x  —  t)2dx  = 

fV  f°°  e-p(x~q)2dx  -  2tew  f°°  xe~p(x~q)2dx  +  er  f°°  x2e~p^-q)2 dx. 

Similar  to  Equation  (B.8),  t2ev  e~p^x~q^dx  =  t2ev^c. 


Substitute  z  —  x  —  q.  Then  a;  =  z  +  g,  and 

xe~p(x~q)2dx  =  /^(z  +  q)e~p^2 dx  =  ze~p(-z^dz  +  g  e~p(-z^dz. 


B-5 


Note  that  f°°  ze  p^2 dz  =  0. 

J  —  OO 

(Recall  that  p  —  1/c2.) 

Then  —2 tev  JZxe~p('x~g^  dx  =  —2tev(0  +  q\/27r^)  =  —2tevq^/nc. 

Similar  to  above,  let  z  =  x  —  q,  and  fZ  x2e~p<^x~q^dx  =  fZ(z  +  q)2e~p^ dz. 

Then 


fZ(z  +  q)2e  p^2 dz  —  f™00(z2  +  2qz  +  q2)e  p^2 dz  — 

J Z  z2e~p^2 dz  +  J 'Z  2qze~p^2 dz  +  JZ  q2e~p^2 dz. 

Note  that  JZ  2 qze~p^ dz  =  0,  and  similar  to  Equation  (B.8), 

fZ  q2e~p(z)2  dz  =  q-sj^c. 

z2e-r^dz  =  (irvm?  = 


Thus 

fZe^1^2<'  c2>  '*e_1//2((  c2  \x  —  t)2dx  =  i2evZc~2tevqZc  +  ev^sJl\JreV(:i1'Zc- 


Note  that  factoring: 


t2evy/Hc  —  2  tevqZc  +  +  evq2\Zc  =  ev\/7rc[t2  —  2tq  +  y 


Thus, 


—  f°  e_1/2(i£^)^_1/2(i£^) 


2  )e  i/2!  c2  )(x  —  f)2dx  =  ev\/TTc[t2  —  2tq  +  —  +  q2].  (B.12) 


Third  General  Form 
A  similar  progression  yields 


B-6 


HJ  1  I'M  (x-s)2  /of(i-<)2 


e-l/2(^)e-l/2(^)(x  _  x)2(a;  _  ^2^  = 


ce^v^jg4  +  c2(3g2)  +  +  [(-2s  -  2t)(q 3  +  ^-)] 


+[(s2  +  t2  +  4st)(g2  +  (-))]. 


+[(g)(-2sr  -  2s' t)}  +  [s  £  ]}. 


Applying  Equations  (B.9),  (B.12),  and  (B.  13)  to  roughness  formula: 


i=i  j= l 


■rM.c)  =  ^  ^  -  I-  -  1 


g4  3g2  3  — 2sg3  —3  qs  —2  tq3  —3  qt  s2q 2  s2  t 2 

H - 7  H - TT  +  T  H  ^  I  o  I - ^  I - o  I - a  f  TTTT  H 


c4  c2  4  c4  c2  c4  c2  c4  2c2  c 


4s£g2  2st  —2  st2q  —2s2tq  s2t2  1/2 

+  — T"  +  —  + - + - T^~  +  —  }  7  • 

c4  c2  c4  c4  c4 


Note  that  the  following  terms 


g4  —2  sg3  —  2fg3  s2g2  t2g2  4s£g2  —2  st2g  —2  s2fg  s2 

A  ^  I  I  ^  I  A  ,  ~f  + 


c4  c4 


c4  c4  c4  c4 


c4  c4  c 


factor  as  follows. 


(B.13) 


g2 

,4  +2c2 


(B.14) 


(B.  15) 


B-7 


The  second  and  third  terms  of  Equation  (B.15)  are 


—2 sq3  |  —2 tq3  —2 q3(s  +  t)  —4 (s  +  t)q3  —4 q4 

+  =  ?  =  2c4  = 

and  thus  the  first  second  and  third  terms  of  Equation  (B.15)  are 

q4  —2  sq3  —2  tq3  —3  q4 

?  +  +  = 

Note  that  the  seventh  and  eighth  terms  of  Equation  (B.15)  are 


(B.16) 


(B.17) 


-2  st(s  +  t)q  —4  st(s  +  t)q  —4  stq2 


c4  2c4  c4  ’ 

and  thus  the  sixth,  seventh  and  eighth  terms  in  Equation  (B.15)  are: 


4sfr/2  —4  stq2 

— ?-  + - =0. 


Note  that  Equation  (B.15)  is  equal  to 


4.  -3g4  ,  s2q 2  ,  t2g2  ,  s2f2 

c  ,erms=—  +  —  +  —  +  — 


Note  that  the  second  and  third  terms  in  Equation  (B.20)  simplify  to: 


(B.18) 


(B.19) 


(B.20) 


s2q 2  t2q2  q2(s2  +  t2)  q2[(s  +  t)2  —  2st\  4q2[(s  +  t)2  —  2st] 

c 4  c4  c4  c4  4c4 

and  that 

4q2[{s  +  t)2  —  2st]  4  q2(s  +  t)2  4g2(— 2st) 

4  c4  4  c4  +  4  c4 

Note  that 

4g2(s  +  f)2  4g2(— 2st)  4  q4  4q2(—2st) 

4c4  4c4  c4  4c4 


(B.21) 

(B.22) 

(B.23) 


B-8 


Therefore,  Equation  (B.20)  simplifies  to: 


4  —3 g4  4 g4  4g2(— 2sf)  s2f2 

c4  terms  =  -^-+^r  +  \  ,  +  — r 


c4  c4 


4c4 


(B.24) 


From  above  equation: 


4  q 4  -2  q2st  s2t2 

c4  terms  =  — r  H - - 1 - -r- 

c4  c4  c4 


(B.25) 


Substitute  v  =  st. 


4  q 4  —2  q2v  v2  ( q 2  —  u)2 

c4  terms  =  ^  +  —  =  — — — - 

c4  c4  c4  c4 


(B.26) 


Next,  examine  the  1 1  terms  with  denominator  of  c 


2  . 


9  —t2  2  tq  —2  q2  —s2  2  sq  3  q2  —3  qs  —3  qt  s 2  t2  2  st  . 

c2  terms  =  +-1  +  —2-+  +-l  +  J!-  +  ^-  +  —2-+  +  +  (B.27) 

cz  cz  cz  2cz  2cz 


Note  that  the  seventh  and  eighth  terms  in  Equation  (B.27)  simplify  to 


3  qs  —3  qt  —3q(s  +  t)  —6  q(s  +  t)  —6  q2 


c 2  c2 


2c2 


(B.28) 


Combining  the  third,  sixth,  seventh,  and  eighth  terms  in  Equation  (B.27): 


2  q2  3  q2  —6  q2  —5  q2 

O 


c2  c2  c2  c2 


(B.29) 


Now,  Equation  (B.27)  is  simplified  as 


9  —t2  2  tq  —5q2  —s2  2sq  s2  t2  2st 

c  tem,s  =  ~  +  "?  +  ~  +  ~  +  -^r  +  ^5  +  ^5  +  -5- 


(B.30) 


Note  that  the  second  and  fifth  terms  in  Equation  (B.30)  are: 


2  tq  2  sq  2  q(s  +  t)  4  q(s  +  t)  4  q2 


c2  c2 


2c2 


(B.31) 


B-9 


and  thus  Equation  (B.30)  simplifies  to: 


-f2  -a2 

c2  terms  =  H - —  + 


-s2  s2 


t2  2  st 

-i  o  ->  o  I  o 


c 


c 


2c2  2c2 


(B.32) 


Combining  the  first,  third,  fourth,  and  fifth  terms  of  Equation  (B.32): 


+ 


-s2  s2 


+  + 


t2 


t2  -s2 

+ 


2c2  2c2  2c2  2c2 


(B.33) 


Now,  Equation  (B.32)  simplifies  to: 


2  ^  —u  —  s2  q2  2 st 


c  terms  = 


2c2 


+ 


c2  c2 


(B.34) 


Replace  — for  ^  in  the  above  equation,  and  use  a  common  denominator  to  obtain: 


c  terms  = 


— 2f2  —  2s2  (s  +  t)2  8st 


4c2 


4c2 


+  4c2 


(B.35) 


Therefore, 


c2  terms 


—2 12  —  2s2  —  ( s 2  +  2st  +  f2)  +  8  st 
4c2 


— 2f2  —  2s2  —  s2  —  2  st  —  t2  +  8sf 


4c2 


— 2f2  —  2s 2  —  s2  —  2sf  —  f2  +  8sf 


4c2 

—3  f2  —  3s2  +  6st 
4c2 


(B.36) 

(B.37) 


and 


c2  terms 


— 3(f2  +  s2  —  2  st) 
4c2 


— 3((s  +  t )2  —  4sf) 
4c2 


-3  q2  I2v 

c 2  4c2 


—3  q2  3v  3  {v  —  q2) 

O  ~ f"  r>  O 


(B.38) 


B-10 


Combining  the  c4,  c2,  and  constant  terms,  and  inserting  into  the  roughness  formula: 


r(a,  b,  c ) 


n  n 

*EE 

i= 1  i=i 


[3  (3)(u-g2)  (u-g2)2 

\4+  c2  +  c4 


1/2 

(B.39) 


Substitute  7  = 


Note  that 


7  = 


v  —  q2  v  q2  st  ( s  +  t)2  4  st  —  (s  +  t)2  4sf  —  (s2  +  2st  +  t 2) 


c2  c2  c2  c2  4c2 


4c2 


4c2 


and 


Thus, 


7  = 


4st  —  s2  —  2st  —  t2  —  (s2  —  2st  +  t2)  — ( s  —  t )2 


4c2 


4c2 


4c2 


(B.40) 

(B.41) 


r(a,b,c)  =  I  (;]/'  (x;  a,  b,  c))2dx  =  EE  It  +  37  +  72!>,  (B.42) 


where 


7  = 


*=1  3= 1 


“I®*  - 


4c2 


(B.43) 


B-ll 


Appendix  C.  ROC  Curve  and  CEG  Curve  Probability  Density  and 

Confidence  Interval  Software 


Appendix  C-l  details  code  [Parker,  2005]  that  computes  median  estimates  of  ROC  curves 
and  AUC  values,  with  confidence  intervals,  for  any  set  of  target  and  non-target  input 
score  samples,  assuming  beta  target  and  non-target  densities.  Appendix  C-2  is  identical 
in  purpose,  except  that  it  assumes  two-beta  mixture  target  and  non-target  densities. 

These  appendices  contain  instructions  for  additional  code  that  assumes  target  and 
non-target  densities  with  fixed,  user- specified  parameters.  This  additional  code  generates 
many  sets  of  representative  target  and  non-target  samples  from  the  fixed  densities,  and  it 
provides  corresponding  ROC  curve  coverage  accuracies.  Appendix  C-3  and  C-4 
describe  code  identical  in  purpose  to  C-l  and  C-2,  but  for  CEG  curves  and  RSD  values. 
The  end  of  Section  5  compares  the  beta  and  two-beta  density  approaches;  the  principal 
approach  applied  in  the  research  reported  here  is  the  single  beta  model.  The  code  for 
each  of  the  Matlab  files  that  comprise  the  user  interface  is  also  provided  here.  The 
remaining  Matlab  files  are  functions  that  are  called  upon  execution  of  the  user  interface. 


C-l 


Appendix  C-l 

ROC  curve  /AUC  value  Estimation  and  Confidence  Interval 

Matlab  Instructions 

Beta  Density  Target  and  Non-target  Model 

A.  Provide  a  set  of  target  samples,  non-target  samples,  and  confidence  bound  value.  Then  compute 
the  confidence  intervals  based  on  these  samples. 

1 .  Place  the  following  files  into  a  common  directory. 

(For  example:  c:\matlab_svl2\work\) 

beta_mean_w_a_b_r.m 

conditioned_calc_2_r.m 

find_max_variance_r.m 

getaurcvalr.m 

getdensityvalsr.m 

get_grid_points_closest_r.m 

get_grid_points_n_closest_r.m 

get_grid_points_r.m 

get_pd_pfa_matrix_  1  Or.  m 

get_pd_pfa_pairs_pdfs2_r.m 

high_low_grid_weight_r.m 

mean_variance_to_pdf_2_r.m 

pd_pfa_from_mean_std_r.m 

pfa_pd_to_hundredths_r.m 

scriptforsamplesr.m 

uni_pdf_for_samples_r 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu,  if 
the  directory  is  not  already  in  the  path. 

3.  Execute  the  following  in  Matlab.  An  example  is  contained  in  ‘script  for  samples  r.m’. 

Enter  (or  load)  a  vector  of  target  scores  into  the  variable  ‘new  target  scores’. 

Enter  (or  load)  a  vector  of  nontarget  scores  into  the  variable  ‘new_nontarget_scores’. 

Execute  the  following  matlab  code  to  produce  ROC  curve  and  AUC  value  estimates  with  confidence  intervals: 

uni_pdf _for_samples_r(new_target_scores,new_nontarget_scores,.95); 

Replace  .95  by  alternate  confidence  interval  coverage  if  desired  (e.g.  .90,  .80). 

To  obtain  the  upper  and  lower  confidence  interval  limit  values  for  false  alarm  probabilities  0,  .01,  ...,  .99, 

1,  (rather  than  an  on-screen  plot),  execute  the  following: 

|ci_median,  ciupper,  cilower,  aucmedian,  aucupper,  auc  lower]  =  ... 
uni_pdf  for_samples_r(new_target_scores,new_nontarget_scores,bound_value); 

ci  median  -  ROC  curve  estimate 

ci  upper  -  Upper  ROC  curve  confidence  interval  contour 

ci  lower  -  Lower  ROC  curve  confidence  interval  contour 

auc  median  -  AUC  value  estimate 

auc  upper  -  Upper  AUC  value  confidence  interval  estimate 

auc  lower  -  Lower  AUC  value  confidence  interval  estimate 


B.  Generate  many  sets  of  samples  for  selected  underlying  target  and  nontarget  densities,  and  then 
obtain  confidence  intervals  and  estimates  for  the  ROC  /  AUC  for  each  set  of  samples  and  compute 
confidence  interval  accuracy  (e.g.  alpha)  among  all  sets.  This  process  assumes  a  single  beta  model 
for  target  and  non-target. 

1.  Place  the  following  files  into  a  common  directory. 

For  example:  c:\matlab_svl2\work\roc\ 

beta_mean_w_a_b_r.m 

conditioned_calc_2_r.m 

find_max_variance_r.m 

generic_rnd.m 

getaurcvalr.m 

getdensityvalsr.m 

get_grid_points_closest_r.m 

get_grid_points_n_closest_r.m 

get_grid_points_r.m 

get_pd_pfa_matrix_  1  Or.  m 

get_pd_pfa_pairs_pdfs2_r.m 

high_low_grid_weight_r.m 

mean_variance_to_pdf_2_r.m 

pd_pfa_from_mean_std_r.m 

pfa_pd_to_hundredths_r.m 

runchoosesample.m 

run_Uu_aurc_95_r.m 

samplegenunitestr.m 

sample_gen_user_input_r.m 

script_ROC_AUC_CIs_with_coverage_accuracy.m 

uni_pdf_aurc_95_r.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu. 

3.  Open  the  File  ‘script_ROC_AUC_CIs_with_coverage_accuracy.m’. 

Lines  7-18.  Specify  number  of  target  samples,  number  of  non-target  samples,  specify  a  beta 
density  by  mean  and  variance  of  assumed  target  beta  density,  mean  and  variance  of  assumed  non¬ 
target  beta  density  number  of  runs,  or  provide  any  density  form  as  input  (Lines  23-24  provide  an 
example). 

Evaluate  Lines  1  through  88.  [Note  in  Matlab  this  can  be  achieved  by  highlighting  these  line. 
Then  right  click  to  obtain  a  menu.  Then  choose  ‘Evaluate  Selection’.]  After  each  run,  the  full  set 
of  results  are  saved  in  Line  88. 

ROC  curve  for  a  single  run  with  confidence  intervals: 

a.  Form  a  plot  of  estimated  ROC  with  confidence  intervals  [with  true  ROC]  by  Evaluating  Lines 
92-105.  Note  that  run_number  on  line  92  may  be  adjusted  to  any  run  among  the  set  specified 
in  step  3  above. 

Obtain  coverage  for  the  full  set  of  runs  by  Evaluating  Lines  1 11-176.  The  mean  alpha  for  AUC 
over  many  runs  is  displayed  at  the  top  of  the  plot. 
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Appendix  C-2 

ROC  curve  /AUC  value  Estimation  and  Confidence  Interval 

Matlab  Instructions 

Two-Beta  Mixture  Target  and  Non-Target  Density  Model 

A.  Provide  a  set  of  target  samples,  non-target  samples,  and  confidence  bound  value.  Then  compute 
the  confidence  intervals  based  on  these  samples  (assumes  a  two-beta  mixture  model). 

1 .  Place  the  following  files  into  a  common  directory. 

(For  example:  c:\matlab_svl2\work\) 

beta_mean_w_ab_2br.m 

combine_beta_pdf_2br.m 

conditioned_calc_2_2br.m 

find_max_variance_2br.m 

get_aurc_val_2br.m 

get_pd_pfa_matrix_l  0_2br.m 

get_pd_pfa_pairs_pdfs2_2br.m 

mixture_pdf_2br.m 

pfa_pd_to_hundredths_2br.m 

rand_two_beta_density_2br.m 

roc_from_density_2br.  m 

sample_gen_bimodal_2br.m 

two_beta_script_for_given_samples_2br.m 

two_beta_roc_truth_not_known_2br.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu,  if 
the  directory  is  not  already  in  the  path. 

3.  Execute  the  following  in  Matlab.  An  example  is  contained  in 
‘two_beta_script_for_given_samples_2br.m’. 

Enter  (or  load)  a  vector  of  target  scores  into  the  variable  ‘new_target_scores’. 

Enter  (or  load)  a  vector  of  nontarget  scores  into  the  variable  ‘new_nontarget_scores’. 

Execute  the  following  matlab  code  to  produce  ROC  curve  and  AUC  value  estimates  with  confidence  intervals: 

two_beta_roc_truth_not_known(new_target_scores,new_nontarget_scores, 10000,  .95); 

Replace  10000  by  the  desired  number  of  random  draws  (lower  numbers  of  draws  decrease 
computational  time).  An  approach  is  to  begin  with  a  low  number  of  draws  and  gradually  increase 
until  convergence  of  confidence  interval  solution  is  observed. 

Replace  .95  by  alternate  confidence  interval  coverage  if  desired  (e.g.  .90,  .80). 

To  obtain  the  upper  and  lower  confidence  interval  limit  values  for  false  alarm  probabilities  0,  .01,  ...,  .99, 

1,  (rather  than  an  on-screen  plot),  execute  the  following: 

|ci_median,  ciup per,  cilower,  aucmedian,  aucupper,  auclower]  =  ... 
two_beta_roe_truth_not_known(new_target_scores,new_nontarget_scores, 10000,  .95); 

ci  median  -  ROC  curve  estimate 

ci  upper  -  Upper  ROC  curve  confidence  interval  contour 

ci  lower  -  Lower  ROC  curve  confidence  interval  contour 

auc  median  -  AUC  value  estimate 

auc  upper  -  Upper  AUC  value  confidence  interval  estimate 

auc  lower  -  Lower  AUC  value  confidence  interval  estimate 


B.  Generate  many  sets  of  samples  for  selected  underlying  target  and  nontarget  densities,  and  then 
obtain  confidence  intervals  and  estimates  for  the  ROC  curve  /  AUC  value  for  each  set  of  samples  and 
compute  confidence  interval  accuracy  (e.g.  alpha)  among  all  sets.  This  process  assumes  a  two-beta 
mixture  model  for  target  and  non-target. 

1.  Place  the  following  files  into  a  common  directory. 

For  example:  c:\matlab_svl2\work\roc\ 

beta_mean_w_ab_2br.m 

combine_beta_pdf_2br.m 

conditioned_calc_2_2br.m 

find_max_variance_2br.m 

get_aurc_val_2br.m 

get_pd_pfa_matrix_l  0_2br.m 

get_pd_pfa_pairs_pdfs2_2br.m 

mixture_pdf_2br.m 

pfa_pd_to_hundredths_2br.m 

rand_two_beta_density_2br.m 

roc_from_density_2br.  m 

sample_gen_bimodal_2br.m 

two_beta_script_for_many_runs_2br.m 

twobeta_mn_nonempirical_2br.m 

two_beta_unipdf_aurc_  1 000_2br.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu. 

3.  Open  the  File  ‘two_beta_script_for_many_runs_2br.m’. 

Lines  4-22.  Specify  number  of  target  samples,  number  of  non-target  samples,  specify  the  five 
parameters  for  the  target  density  for  a  two-beta  mixture  model  (two  means,  two  standard 
deviations,  and  a  ratio),  the  five  parameters  for  the  non-target  density,  the  number  of 
random_draws  desired  (e.g.  2000),  confidence  interval  desired  (ci_range;  example  is  .90  for  90% 
confidence  intervals),  and  the  number  of  test  runs  (number_of_runs;  example  is  100  if  100  test 
runs  are  desired).  An  example  is  provided;  change  these  values  as  desired. 

Evaluate  Lines  1  through  77.  [Note  in  Matlab  this  can  be  achieved  by  highlighting  these  lines. 
Then  right  click  to  obtain  a  menu.  Then  choose  ‘Evaluate  Selection’.]  After  each  run,  the  frill  set 
of  results  are  saved  in  Line  88. 

ROC  curve  for  a  single  run  with  confidence  intervals: 

a.  Form  a  plot  of  estimated  ROC  with  confidence  intervals  [with  true  ROC]  by  Evaluating  Lines 
78-93.  Note  that  mn_number  on  line  79  may  be  adjusted  to  any  run  among  the  set  specified 
in  step  3  above. 

b.  Obtain  coverage  for  the  full  set  of  runs  by  Evaluating  Lines  100-179.  The  mean  alpha  for 
AUC  over  many  runs  is  displayed  at  the  top  of  the  plot. 


:  \uncertainty  estimation  code\two  beta  mixture  mo...\two  beta  script  for  given  samples  2br. 
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Appendix  C-3 

CEG/RSD  Estimation  and  Confidence  Interval  Matlab  Instructions 
Beta  Density  Target  and  Non-target  Model 

A.  Provide  a  set  of  target  samples  and  non-target  samples.  Then  estimate  the  CEG  curve  and  RSD 

and  associated  confidence  intervals  based  on  these  samples,  (assuming  a  single  beta  density  model). 

1 .  Place  the  following  files  into  a  common  directory. 

(c  :\matlab_sv  1 2\work\ceg\) 

beta_mean_c.m 

beta_mean_w_a_b_c.m 

conditioned_calc_2_c.m 

conferrornewwreturnc.m 

conferrornewweightedc.m 

find_max_variance_c.m 

getdensityvalsc .  m 

get_grid_points_c.m 

get_grid_points_closest_c.m 

get_grid_points_n_closest_c.m 

get_pd_pfa_matrix_10_c.m 

get_pd_pfa_pairs_pdfs2_c.m 

high_low_grid_weight_c.m 

mean_variance_to_pdf_2_c.m 

scriptforsamplesc.m 

uni_ce_pdf_samples_c.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu,  if 
the  directory  is  not  already  in  the  path. 

3.  Execute  the  following  in  Matlab.  An  example  is  contained  in  ‘script  for  samples  c.m’. 

Enter  (or  load)  a  vector  of  target  scores  into  the  variable  ‘new  target  scores’. 

Enter  (or  load)  a  vector  of  nontarget  scores  into  the  variable  ‘new  nontarget  scores’. 

Execute  the  following  matlab  code  to  produce  ROC  curve  and  AUC  value  estimates  with  confidence  intervals: 
uni_ce_pdf_samples_c(new_target_scores,new_nontarget_scores,.95,.5); 

Replace  .95  by  alternate  confidence  interval  coverage  if  desired  (e.g.  .90,  .80). 

Replace  .5  by  prior  probability  of  target. 

To  obtain  the  upper  and  lower  confidence  interval  limit  values  for  false  alarm  probabilities  0,  .01,  ...,  .99, 

1,  (rather  than  an  on-screen  plot),  execute  the  following: 

|ci_median,  ci  upper,  ci  lower,  rsdmedian,  rsd  upper,  rsd  lower]  =  ... 

uni_pdf  for_samples_c(new_target_scores,new_nontarget_scores,bound_value,prior_target); 


cimedian  - 
ci  upper  - 
ci_lower  - 
rsd  median  - 
rsd_upper  - 
rsd  lower  - 


ROC  curve  estimate 

Upper  ROC  curve  confidence  interval  contour 
Lower  ROC  curve  confidence  interval  contour 
AUC  value  estimate 

Upper  AUC  value  confidence  interval  estimate 
Lower  AUC  value  confidence  interval  estimate 


B.  Generate  many  sets  of  samples  for  selected  underlying  target  and  nontarget  densities,  and  then 
obtain  confidence  intervals  and  estimates  for  the  CEG  curve  /  RSD  value  for  each  set  of  samples  and 
compute  confidence  interval  accuracy  (e.g.  alpha)  among  all  sets.  This  process  assumes  a  single  beta 
model  for  target  and  non-target. 

1 .  Place  the  following  files  into  a  common  directory. 

For  example:  c:\matlab_svl2\work\roc\ 

beta_mean_c.m 

b  eta_me  an_w_a_b_c .  m 

conditioned_calc_2_c.m 

conferrornewwretumc.m 

conferrornewweightedc.m 

find_max_variance_c.m 

getdensityvalsc .  m 

get_grid_points_c.m 

get_grid_points_closest_c.m 

get_grid_points_n_closest_c.m 

get_pd_pfa_matrix_10_c.m 

get_pd_pfa_pairs_pdfs2_c.m 

high_low_grid_weight_c.m 

mean_variance_to_pdf_2_c.m 

script_CEG_RSD_CIs_with_coverage_accuracy.m 

runchoosesamplecegc.m 

runcegcheckc.m 

samplegenchoosedensityc.m 

sample_gen_uni_t_extend_c.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu. 

3.  Open  the  File  ‘script_CEG_RSD_CIs_with_coverage_accuracy.m’. 

Lines  8-21.  Specify  number  of  target  samples,  number  of  non-target  samples,  specify  a  beta 
density  by  mean  and  variance  of  assumed  target  beta  density,  mean  and  variance  of  assumed  non¬ 
target  beta  density,  number  of  runs  (how  many  test  runs  are  desired),  and  prior  probability  of 
target.  Alternatively,  specify  the  assumed  target  density  and  non-target  density  by  the  magnitude 
of  the  density  at  each  of  1001  evaluation  points  (lines  24-27  provide  an  example). 

Evaluate  Lines  1  through  83.  [Note  in  Matlab  this  can  be  achieved  by  highlighting  these  line. 
Then  right  click  to  obtain  a  menu.  Then  choose  ‘Evaluate  Selection’.]  After  each  run,  the  full  set 
of  results  can  be  saved  per  Line  81. 

ROC  curve  for  a  single  run  with  confidence  intervals: 

a.  Form  a  plot  of  estimated  ROC  with  confidence  intervals  [with  true  ROC]  by  Evaluating  Lines 
85-98.  Note  that  mn_number  on  line  92  may  be  adjusted  to  any  run  among  the  set  specified 
in  step  3  above. 

Obtain  coverage  for  the  full  set  of  runs  by  Evaluating  Lines  104-182.  The  mean  alpha  for  RSD 
over  many  runs  is  displayed  at  the  top  of  the  plot. 
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Appendix  C-4 

CEG/RSD  Estimation  and  Confidence  Interval  Matlab  Instructions 
Two-Beta  Mixture  Target  and  Non-Target  Density  Model 


A.  Provide  a  set  of  target  samples,  non-target  samples,  and  confidence  bound  value.  Then  compute 
the  CEG  curve  and  AUC  value  median  estimates  and  confidence  intervals  based  on  these  samples 
(assumes  a  two-beta  mixture  model). 

1 .  Place  the  following  files  into  a  common  directory. 

(For  example:  c:\matlab_svl2\work\) 

beta_mean_w_a_b_2bc.m 

combine_beta_pdf_2bc.m 

conditioned_calc_2_2bc.m 

conf_error_new_w_return_2bc.m 

conf_error_new_weighted_2bc.m 

find_max_variance_2bc.m 

mixture_pdf_2bc.m 

rand_two_beta_density_2bc.m 

sample_gen_bimodal_2bc.m 

score_pts_to_hundredths_2bc.m 

two_beta_script_for_given_samples_2bc.m 

twobeta_ceg_truth_not_known_2bc.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu,  if 
the  directory  is  not  already  in  the  path. 

3.  Execute  the  following  in  Matlab.  An  example  is  contained  in 
‘two_beta_script_for_given_samples_2bc.m’ 

Enter  (or  load)  a  vector  of  target  scores  into  the  variable  ‘new  target  scores’. 

Enter  (or  load)  a  vector  of  nontarget  scores  into  the  variable  ‘new  nontarget  scores’. 

Execute  the  following  matlab  code  to  produce  ROC  curve  and  AUC  value  estimates  with  confidence  intervals: 

two_beta_ceg_truth_not_known(new_target_scores,new_nontarget_scores, 10000,  .95); 

Replace  10000  by  the  desired  number  of  random  draws  (lower  numbers  of  draws  decrease 
computational  time).  An  approach  is  to  begin  with  a  low  number  of  draws  and  gradually  increase 
until  convergence  of  confidence  interval  solution  is  observed. 

Replace  .95  by  alternate  confidence  interval  coverage  if  desired  (e.g.  .90,  .80). 

To  obtain  the  upper  and  lower  confidence  interval  limit  values  for  scores  0,  .01,  ...,  .99,  1,  (rather  than  an 
on-screen  plot),  execute  the  following: 

[cimedian,  ciupper,  cilower,  rsdmedian,  rsdupper,  rsdlower]  =  ... 
two_beta_ceg_truth_not_known(new_target_scores,new_nontarget_scores, 10000,  .95); 

ci  median  -  ROC  curve  estimate 

ci  upper  -  Upper  ROC  curve  confidence  interval  contour 

ci  lower  -  Lower  ROC  curve  confidence  interval  contour 

rsd  median  -  AUC  value  estimate 

rsd  upper  -  Upper  AUC  value  confidence  interval  estimate 

rsd  lower  -  Lower  AUC  value  confidence  interval  estimate 


B.  Generate  many  sets  of  samples  for  selected  underlying  target  and  nontarget  densities,  and  then 
obtain  confidence  intervals  and  estimates  for  the  CEG  curve  /  RSD  value  for  each  set  of  samples  and 
compute  confidence  interval  accuracy  (e.g.  alpha)  among  all  sets.  This  process  assumes  a  single  beta 
model  for  target  and  non-target. 

1 .  Place  the  following  files  into  a  common  directory. 

For  example:  c:\matlab_svl2\work\roc\ 

beta_mean_w_a_b_2bc.m 

combine_beta_pdf_2bc.m 

conditioned_calc_2_2bc.m 

conf_error_new_w_return_2bc.m 

conf_error_new_weighted_2bc.m 

find_max_variance_2bc.m 

mixture_pdf_2bc.m 

rand_two_beta_density_2bc.m 

sample_gen_bimodal_2bc.m 

score_pts_to_hundredths_2bc.m 

two_beta_script_for_many_runs_2bc.m 

twobeta_mn_nonempirical_2bc.m 

twobeta_unipdf_rsd_  1 000_2bc.m 

2.  Add  the  common  directory  to  the  Matlab  path  by  using  ‘File  /  Set  Path’  option  in  Matlab  menu. 

3.  Open  the  File  ‘two_beta_script_for_many_runs_2bc.m’. 

Lines  4-23.  Specify  number  of  target  samples,  number  of  non-target  samples,  specify  a  target  beta 
density  by  the  five  parameters  of  an  assumed  beta  density,  specify  non-target  density  by  the  five 
parameters  of  an  assumed  beta  density,  number  of  runs  (how  many  test  runs  are  desired),  and  prior 
probability  of  target.  Also  specify  the  number  of  random  draws;  this  is  a  computational  constraint, 
the  number  of  draws  selects  how  many  grid  points  to  evaluate  for  the  target  and  non-target 
densities.  An  option  is  to  begin  at  a  number  that  executes  quickly  (e.g.  2000),  then  increase  until 
observing  convergence  of  computed  confidence  intervals. 

Evaluate  Lines  1  through  7 1 .  [Note  in  Matlab  this  can  be  achieved  by  highlighting  these  line. 

Then  right  click  to  obtain  a  menu.  Then  choose  ‘Evaluate  Selection’.]  After  each  run,  the  full  set 
of  results  can  be  saved  per  Line  69. 

ROC  curve  for  a  single  run  with  confidence  intervals: 

a.  Form  a  plot  of  estimated  ROC  with  confidence  intervals  [with  true  ROC]  by  Evaluating  Lines 
73-85.  Note  that  mn_number  on  line  92  may  be  adjusted  to  any  run  among  the  set  specified 
in  step  3  above. 


Obtain  coverage  for  the  full  set  of  runs  by  Evaluating  Lines  94-173.  The  mean  alpha  for  RSD 
over  many  runs  is  displayed  at  the  top  of  the  plot. 


:  \uncertainty  estimation  code\two  beta  mixture  mo...\two  beta  script  for  given  samples  2bc. 
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